Skip to content

Tuning

How to adjust haiku.rag's pipeline for better retrieval and answer quality. For individual setting definitions and defaults, see Configuration.

Pipeline Overview

Documents flow through: chunking → embedding → hybrid search (vector + FTS) → reranking → context expansion → LLM generation. Retrieval tuning (chunking through reranking) is highest-leverage — if the LLM never sees the right chunks, no prompt or model change will help.

Tuning Retrieval

Chunking

chunk_size controls the granularity of retrieval. Smaller chunks match queries more precisely but carry less context each; larger chunks provide more surrounding information but dilute relevance signals. On the Wix benchmark, increasing from 256 to 512 tokens raised MAP from 0.43 to 0.45 on plain text — a modest gain that also increases token cost per result. See Processing for configuration.

chunker_type selects between hybrid (default) and hierarchical chunking. Hierarchical chunking preserves the document's heading structure and works better for deeply nested or structured content. See Chunking Strategies.

Embedding Model

Larger embedding models produce better representations at the cost of slower indexing and more storage. The choice of embedding model has a larger impact on retrieval quality than most other settings. See Providers for available options and Benchmarks for real comparisons across models.

Reranking

When configured, a cross-encoder reranker re-scores 10x the requested candidates and returns the top results. This adds latency but improves precision — on the Wix benchmark, adding mxbai-rerank-base-v2 raised MAP from 0.34 to 0.39 on HTML content. See Search Settings for how reranking integrates with search.

Search Settings

limit controls how many results reach the LLM. More candidates improve recall but increase token usage. See Search Settings.

Context expansion is automatic and section-aware — search results are expanded to include surrounding content from the same document section. For structured documents, expansion stays within section boundaries and filters noise (footnotes, page headers). For unstructured documents, expansion grows outward until the character budget is filled. max_context_chars caps expansion to prevent context bloat.

Tuning Generation

Model and temperature selection affect answer quality directly — see Providers for options.

domain_preamble prepends domain context to all agent prompts — including the main agent, skill subagents, and internal agents (QA, research). Use it to describe what the knowledge base contains and clarify domain-specific terminology. For full prompt replacement, set prompts.qa directly. See Prompt Customization.

For automated prompt optimization, see Prompt Optimization (GEPA) below.

What Requires a Rebuild

Change Rebuild required?
chunk_size, chunker_type, chunking_merge_peers Yes — haiku-rag rebuild
Embedding model Yes — haiku-rag rebuild
Search settings, reranking, prompts No

Measuring Changes

Use the inspector for ad-hoc exploration:

haiku-rag inspect

For systematic measurement, use the evaluations/ workspace which provides retrieval metrics (MRR, MAP) and LLM-judged QA accuracy via pydantic-evals:

# Run retrieval + QA benchmarks
evaluations run <dataset>

# Skip database rebuild when only changing search/reranking/prompt settings
evaluations run <dataset> --skip-db

# Limit test cases for faster iteration
evaluations run <dataset> --limit 50

See Benchmarks for dataset details, methodology, and baseline results.

Prompt Optimization (GEPA)

The evaluations optimize command uses GEPA (Generalized Evolutionary Prompt Algorithm) to evolve the QA system prompt. It evaluates candidates on minibatches scored by an LLM judge, reflects on failures, proposes mutations, and accepts improvements.

# Basic optimization
evaluations optimize wix

# Constrained run
evaluations optimize repliqa --limit 40 --num-candidates 30

# Save result
evaluations optimize wix --output optimized_prompt.txt
Option Default Description
--limit all cases QA cases to use (split 50/50 train/val)
--num-candidates 50 Number of candidate prompts to evaluate
--output Save optimized prompt to file
--config auto haiku.rag YAML config path
--db auto Database path override
--judge-model config.qa.model LLM judge as provider:name
--reflect-model config.qa.model Reflection LLM as provider:name

Apply the result in your config:

prompts:
  qa: |
    Your optimized prompt text here...

Or programmatically: get_qa_agent(client, config, system_prompt=optimized_prompt).