Document Processing & Monitoring
This guide covers how haiku.rag converts, chunks, and monitors documents.
Document Processing
Configure how documents are converted and chunked:
processing:
# Chunking configuration
chunk_size: 256 # Maximum tokens per chunk
context_chunk_radius: 0 # Context radius for chunk expansion
markdown_preprocessor: "" # Optional preprocessor script
# Converter selection
converter: docling-local # docling-local or docling-serve
# Chunker selection and configuration
chunker: docling-local # docling-local or docling-serve
chunker_type: hybrid # hybrid or hierarchical
chunking_tokenizer: "Qwen/Qwen3-Embedding-0.6B" # HuggingFace model for tokenization
chunking_merge_peers: true # Merge undersized successive chunks
chunking_use_markdown_tables: false # Use markdown tables vs narrative format
# Conversion options (works with both local and remote converters)
conversion_options:
# OCR settings
do_ocr: true # Enable OCR for bitmap content
force_ocr: false # Replace existing text with OCR
ocr_lang: [] # OCR languages (e.g., ["en", "fr", "de"])
# Table extraction
do_table_structure: true # Extract table structure
table_mode: accurate # fast or accurate
table_cell_matching: true # Match table cells back to PDF cells
# Image settings
images_scale: 2.0 # Image scale factor
Conversion Options
The conversion_options section allows fine-grained control over document conversion. These options work with both docling-local and docling-serve converters.
OCR Settings
conversion_options:
do_ocr: true # Enable OCR for bitmap/scanned content
force_ocr: false # Replace all text with OCR output
ocr_lang: [] # List of OCR languages, e.g., ["en", "fr", "de"]
- do_ocr: When
true, applies OCR to images and scanned pages. Disable for faster processing if documents contain only native text. - force_ocr: When
true, replaces existing text layers with OCR output. Useful for documents with poor text extraction. - ocr_lang: List of language codes for OCR. Empty list uses default language detection. Examples:
["en"],["en", "fr", "de"].
Table Extraction
conversion_options:
do_table_structure: true # Extract structured table data
table_mode: accurate # fast or accurate
table_cell_matching: true # Match cells back to PDF
- do_table_structure: When
true, extracts table structure. Disable for faster processing if tables aren't important. - table_mode:
accurate: Better table structure recognition (slower)fast: Faster processing with simpler table detection- table_cell_matching: When
true, matches detected table cells back to PDF cells. Disable if tables have merged cells across columns.
Image Settings
- images_scale: Scale factor for extracted images. Higher values = better quality but larger size. Typical range: 1.0-3.0.
Local vs Remote Processing
Local processing (default):
- Uses
doclinglibrary locally - No external dependencies
- Good for development and small workloads
Remote processing (docling-serve):
- Offloads processing to docling-serve API
- Better for heavy workloads and production
- Requires docling-serve instance (see Remote processing setup)
To use remote processing:
processing:
converter: docling-serve
chunker: docling-serve
providers:
docling_serve:
base_url: http://localhost:5001
api_key: "your-api-key" # Optional
timeout: 300 # Request timeout in seconds
Conversion options work identically for both local and remote processing.
Chunking Strategies
Hybrid chunking (default): - Structure-aware chunking - Respects document boundaries - Best for most use cases
Hierarchical chunking: - Creates hierarchical chunk structure - Preserves document hierarchy - Useful for complex documents
Table Serialization
Control how tables are represented in chunks:
false: Tables as narrative text ("Value A, Column 2 = Value B")true: Tables as markdown (preserves table structure)
Chunk Size and Context
processing:
# Chunk size for document processing
chunk_size: 256
# Number of adjacent chunks to include before/after retrieved chunks for context
# 0 = no expansion (default), 1 = include 1 chunk before and after, etc.
# When expanded chunks overlap or are adjacent, they are automatically merged
# into single chunks with continuous content to eliminate duplication
context_chunk_radius: 0
Markdown Preprocessor
Optionally preprocess Markdown before chunking by pointing to a callable that receives and returns Markdown text. This is useful for normalizing content, stripping boilerplate, or applying custom transformations before chunk boundaries are computed.
processing:
# A callable path in one of these formats:
# - package.module:func
# - package.module.func
# - /abs/or/relative/path/to/file.py:func
markdown_preprocessor: my_pkg.preprocess:clean_md
Note
- The function signature should be
def clean_md(text: str) -> strorasync def clean_md(text: str) -> str. - If the function raises or returns a non-string, haiku.rag logs a warning and proceeds without preprocessing.
- The preprocessor affects only the chunking pipeline. The stored document content remains unchanged.
Example implementation:
# my_pkg/preprocess.py
def clean_md(text: str) -> str:
# strip HTML comments and collapse multiple blank lines
lines = [line for line in text.splitlines() if not line.strip().startswith("<!--")]
out = []
for line in lines:
if line.strip() == "" and (out and out[-1] == ""):
continue
out.append(line)
return "\n".join(out)
File Monitoring
Set directories to monitor for automatic indexing:
Filtering Monitored Files
Use gitignore-style patterns to control which files are monitored:
monitor:
directories:
- /path/to/documents
# Exclude specific files or directories
ignore_patterns:
- "*draft*" # Ignore files with "draft" in the name
- "temp/" # Ignore temp directory
- "**/archive/**" # Ignore all archive directories
- "*.backup" # Ignore backup files
# Only include specific files (whitelist mode)
include_patterns:
- "*.md" # Only markdown files
- "*.pdf" # Only PDF files
- "**/docs/**" # Only files in docs directories
How patterns work:
- Extension filtering - Only supported file types are considered
- Include patterns - If specified, only matching files are included (whitelist)
- Ignore patterns - Matching files are excluded (blacklist)
- Combining both - Include patterns are applied first, then ignore patterns
Common patterns:
# Only monitor markdown documentation, but ignore drafts
monitor:
include_patterns:
- "*.md"
ignore_patterns:
- "*draft*"
- "*WIP*"
# Monitor all supported files except in specific directories
monitor:
ignore_patterns:
- "node_modules/"
- ".git/"
- "**/test/**"
- "**/temp/**"
Patterns follow gitignore syntax:
*matches anything except/**matches zero or more directories?matches any single character[abc]matches any character in the set