Database and Storage
Local Storage
By default, haiku.rag uses a local LanceDB database:
storage:
data_dir: /path/to/data # Empty = use default platform location
vacuum_retention_seconds: 86400 # Cleanup threshold in seconds
- data_dir: Directory for local database storage. When empty, uses platform-specific default locations
- vacuum_retention_seconds: When documents are added/updated, old table versions older than this are removed. Default: 86400 seconds (1 day, safe for concurrent connections). Set to 0 for aggressive cleanup (removes all old versions immediately)
Remote Storage
For remote storage, use the lancedb settings with various backends:
# LanceDB Cloud
lancedb:
uri: db://your-database-name
api_key: your-api-key
region: us-west-2 # optional
# Amazon S3
lancedb:
uri: s3://my-bucket/my-table
# Use AWS credentials or IAM roles
# Azure Blob Storage
lancedb:
uri: az://my-container/my-table
# Use Azure credentials
# Google Cloud Storage
lancedb:
uri: gs://my-bucket/my-table
# Use GCP credentials
# HDFS
lancedb:
uri: hdfs://namenode:port/path/to/table
Authentication is handled through standard cloud provider credentials (AWS CLI, Azure CLI, gcloud, etc.) or by setting api_key for LanceDB Cloud.
Note: Table optimization is automatically handled by LanceDB Cloud (db:// URIs) and is disabled for better performance. For object storage backends (S3, Azure, GCS), optimization is still performed locally.
Database Auto-creation
haiku.rag intelligently handles database creation based on operation type:
- Write operations (add, add-src, delete, rebuild): Automatically create the database and required tables if they don't exist
- Read operations (list, get, search, ask, research): Fail with a clear error if the database doesn't exist
This prevents the common mistake where a search query accidentally creates an empty database. To initialize your database, simply add your first document using haiku-rag add or haiku-rag add-src.
Vector Indexing
Configure vector indexing behavior for efficient similarity search:
search:
vector_index_metric: cosine # cosine, l2, or dot
vector_refine_factor: 30 # Re-ranking factor for accuracy
- vector_index_metric: Distance metric for vector similarity:
cosine: Cosine similarity (default, best for most embeddings)l2: Euclidean distancedot: Dot product similarity- vector_refine_factor: Improves accuracy when using a vector index by retrieving
refine_factor * limitcandidates (using approximate search) and re-ranking them with exact distances. Higher values increase accuracy but slow down queries. Default: 30 - Only applies with a vector index - has no effect on brute-force search, which already returns exact results
Note
Vector indexes are only necessary for large datasets with over 100,000 chunks. For smaller datasets, LanceDB's brute-force kNN search provides exact results with good performance. Only create an index if you notice search performance degradation on large datasets.
Index creation:
Vector indexes are not created automatically during document ingestion to avoid slowing down the process. After you've added documents (at least 256 chunks required), create the index manually:
This command: - Checks if you have enough data (minimum 256 chunks) - Creates an IVF_PQ index for fast approximate nearest neighbor (ANN) search - Uses LanceDB's automatic parameter calculation based on your dataset size and vector dimensions
Re-indexing:
Indexes are not automatically updated when you add new documents. After adding a significant amount of new data:
Searches still work with stale indexes - LanceDB uses the index for old data (fast ANN) and brute-force kNN for new unindexed rows, then combines the results. However, performance degrades as more unindexed data accumulates.
For datasets with fewer than 256 chunks, searches use brute-force kNN scans (exact nearest neighbors, 100% recall) which work well for small datasets but don't scale beyond a few hundred thousand vectors.