Skip to content

Benchmarks

We use the repliqa dataset for the evaluation of haiku.rag.

You can perform your own evaluations using as example the script found at tests/generate_benchmark_db.py.

Recall

In order to calculate recall, we load the News Stories from repliqa_3 which is 1035 documents and index them in a sqlite db. Subsequently, we run a search over the question field for each row of the dataset and check whether we match the document that answers the question.

The recall obtained is ~0.73 for matching in the top result, raising to ~0.75 for the top 3 results.

Model Document in top 1 Document in top 3
Ollama / mxbai-embed-large 0.77 0.89
Ollama / nomic-embed-text 0.74 0.88
OpenAI / text-embeddings-3-small 0.75 0.88

Question/Answer evaluation

Again using the same dataset, we use a QA agent to answer the question. In addition we use an LLM judge (using the Ollama qwen3) to evaluate whether the answer is correct or not. The obtained accuracy is as follows:

Embedding Model QA Model Accuracy
Ollama / mxbai-embed-large Ollama / qwen3 0.64
Ollama / mxbai-embed-large Anthropic / Claude Sonnet 3.7 0.79