Skip to content

Benchmarks

We use the repliqa dataset for the evaluation of haiku.rag.

You can perform your own evaluations using as example the script found at tests/generate_benchmark_db.py.

Recall

In order to calculate recall, we load the News Stories from repliqa_3 (1035 documents) and index them. Subsequently, we run a search over the question field for each row of the dataset and check whether we match the document that answers the question. Questions for which the answer cannot be found in the documents are ignored.

The recall obtained is ~0.79 for matching in the top result, raising to ~0.91 for the top 3 results with the "bare" default settings (Ollama qwen3, mxbai-embed-large embeddings, no reranking).

Embedding Model Document in top 1 Document in top 3 Reranker
Ollama / mxbai-embed-large 0.79 0.91 None
Ollama / mxbai-embed-large 0.90 0.95 mxbai-rerank-base-v2
Ollama / nomic-embed-text-v1.5 0.74 0.90 None

Question/Answer evaluation

Again using the same dataset, we use a QA agent to answer the question. In addition we use an LLM judge (using the Ollama qwen3) to evaluate whether the answer is correct or not. The obtained accuracy is as follows:

Embedding Model QA Model Accuracy Reranker
Ollama / mxbai-embed-large Ollama / qwen3 0.85 None
Ollama / mxbai-embed-large Ollama / qwen3 0.87 mxbai-rerank-base-v2
Ollama / mxbai-embed-large Ollama / qwen3:0.6b 0.28 None

Note the significant degradation when very small models are used such as qwen3:0.6b.