Document Processing & Monitoring
This guide covers how haiku.rag converts, chunks, and monitors documents.
Document Processing
Configure how documents are converted and chunked:
processing:
# Chunking configuration
chunk_size: 256 # Maximum tokens per chunk
# Converter selection
converter: docling-local # docling-local or docling-serve
# Chunker selection and configuration
chunker: docling-local # docling-local or docling-serve
chunker_type: hybrid # hybrid or hierarchical
chunking_tokenizer: "Qwen/Qwen3-Embedding-0.6B" # HuggingFace model for tokenization
chunking_merge_peers: true # Merge undersized successive chunks
chunking_use_markdown_tables: false # Use markdown tables vs narrative format
# Conversion options (works with both local and remote converters)
conversion_options:
# OCR settings
do_ocr: true # Enable OCR for bitmap content
force_ocr: false # Replace existing text with OCR
ocr_lang: [] # OCR languages (e.g., ["en", "fr", "de"])
# Table extraction
do_table_structure: true # Extract table structure
table_mode: accurate # fast or accurate
table_cell_matching: true # Match table cells back to PDF cells
# Image settings
images_scale: 2.0 # Image scale factor
generate_page_images: true # Include rendered page images (for visualize_chunk)
generate_picture_images: false # Include embedded figure/diagram images
# VLM picture description (optional)
picture_description:
enabled: false # Enable VLM image descriptions
model:
provider: ollama
name: ministral-3
Conversion Options
The conversion_options section allows fine-grained control over document conversion. These options work with both docling-local and docling-serve converters.
OCR Settings
conversion_options:
do_ocr: true # Enable OCR for bitmap/scanned content
force_ocr: false # Replace all text with OCR output
ocr_lang: [] # List of OCR languages, e.g., ["en", "fr", "de"]
- do_ocr: When
true, applies OCR to images and scanned pages. Disable for faster processing if documents contain only native text. - force_ocr: When
true, replaces existing text layers with OCR output. Useful for documents with poor text extraction. - ocr_lang: List of language codes for OCR. Empty list uses default language detection. Examples:
["en"],["en", "fr", "de"].
Table Extraction
conversion_options:
do_table_structure: true # Extract structured table data
table_mode: accurate # fast or accurate
table_cell_matching: true # Match cells back to PDF
- do_table_structure: When
true, extracts table structure. Disable for faster processing if tables aren't important. - table_mode:
accurate: Better table structure recognition (slower)fast: Faster processing with simpler table detection- table_cell_matching: When
true, matches detected table cells back to PDF cells. Disable if tables have merged cells across columns.
Image Settings
conversion_options:
images_scale: 2.0 # Image resolution scale factor
generate_page_images: true # Include rendered page images
generate_picture_images: false # Include embedded figure/diagram images
- images_scale: Scale factor for extracted images. Higher values = better quality but larger size. Typical range: 1.0-3.0.
- generate_page_images: When
true(default), rendered images of each PDF page are included in the document. Required forvisualize_chunk()to show visual grounding. Whenfalse, page images are excluded to reduce document size. - generate_picture_images: When
true, embedded images (figures, diagrams) are included as base64-encoded data in the document. Whenfalse(default), images are excluded to reduce chunk size and avoid context bloat.
Note: With docling-serve, generate_picture_images has limited support - picture image data may not be returned in the JSON response. Page images work correctly with both local and remote converters.
Picture Description (VLM)
Use a Vision Language Model (VLM) to automatically describe images in documents. Descriptions become searchable text, improving RAG retrieval for visual content.
conversion_options:
picture_description:
enabled: true # Enable VLM picture description
model:
provider: ollama # ollama, openai, or custom
name: ministral-3 # VLM model name
timeout: 90 # Request timeout in seconds
max_tokens: 200 # Maximum tokens in response
Configuration options:
- enabled: When
true, each embedded image is sent to a VLM for description. Requiresgenerate_picture_imagesto betrue(automatically enabled). - model: Standard model configuration
provider:ollama(default),openai, or usebase_urlfor custom endpointsname: Model name (e.g.,ministral-3,granite3.2-vision,gpt-4-vision)base_url: Optional custom API endpoint for vLLM, LM Studio, etc.- timeout: Request timeout in seconds
- max_tokens: Maximum tokens in the VLM response
Note: Requires an OpenAI-compatible /v1/chat/completions endpoint. Providers with different API formats (e.g., Anthropic Claude) are not supported.
Default prompt (configured in prompts.picture_description):
Describe this image for a blind user. State the image type
(screenshot, chart, photo, etc.), what it depicts, any visible text,
and key visual details. Be concise and accurate.
To customize the prompt globally:
Using with Ollama:
Requires Ollama running with a vision-capable model:
Using with vLLM or custom endpoints:
conversion_options:
picture_description:
enabled: true
model:
provider: openai # Use OpenAI-compatible API format
name: granite-vision
base_url: http://my-vllm-server:8000
How it works:
- During PDF conversion, docling extracts embedded images
- Each image is sent to the configured VLM for description
- Descriptions are added as annotations on the image
- When exported to markdown, descriptions appear as searchable text
Using with docling-serve:
When using converter: docling-serve, the VLM calls are made by the docling-serve instance, not by haiku.rag. You must:
- Set
DOCLING_SERVE_ENABLE_REMOTE_SERVICES=truewhen running docling-serve - Ensure the VLM endpoint is accessible from where docling-serve is running
Docker networking: If docling-serve runs in Docker and your VLM runs on the host, use host.docker.internal instead of localhost:
picture_description:
enabled: true
model:
provider: ollama
name: ministral-3
base_url: http://host.docker.internal:11434 # NOT localhost!
See VLM Picture Description with docling-serve for a complete example.
Local vs Remote Processing
Local processing (default):
- Uses
doclinglibrary locally - No external dependencies
- Good for development and small workloads
Remote processing (docling-serve):
- Offloads processing to docling-serve API
- Better for heavy workloads and production
- Requires docling-serve instance (see Remote processing setup)
To use remote processing:
processing:
converter: docling-serve
chunker: docling-serve
providers:
docling_serve:
base_url: http://localhost:5001
api_key: "your-api-key" # Optional
Conversion options work identically for both local and remote processing.
Chunking Strategies
Hybrid chunking (default): - Structure-aware chunking - Respects document boundaries - Best for most use cases
Hierarchical chunking: - Creates hierarchical chunk structure - Preserves document hierarchy - Useful for complex documents
Table Serialization
Control how tables are represented in chunks:
false: Tables as narrative text ("Value A, Column 2 = Value B")true: Tables as markdown (preserves table structure)
Chunk Size
Context expansion settings (for enriching search results with surrounding content) are configured in the search section. See Search Settings.
File Monitoring
Set directories to monitor for automatic indexing:
Filtering Monitored Files
Use gitignore-style patterns to control which files are monitored:
monitor:
directories:
- /path/to/documents
# Exclude specific files or directories
ignore_patterns:
- "*draft*" # Ignore files with "draft" in the name
- "temp/" # Ignore temp directory
- "**/archive/**" # Ignore all archive directories
- "*.backup" # Ignore backup files
# Only include specific files (whitelist mode)
include_patterns:
- "*.md" # Only markdown files
- "*.pdf" # Only PDF files
- "**/docs/**" # Only files in docs directories
How patterns work:
- Extension filtering - Only supported file types are considered
- Include patterns - If specified, only matching files are included (whitelist)
- Ignore patterns - Matching files are excluded (blacklist)
- Combining both - Include patterns are applied first, then ignore patterns
Common patterns:
# Only monitor markdown documentation, but ignore drafts
monitor:
include_patterns:
- "*.md"
ignore_patterns:
- "*draft*"
- "*WIP*"
# Monitor all supported files except in specific directories
monitor:
ignore_patterns:
- "node_modules/"
- ".git/"
- "**/test/**"
- "**/temp/**"
Patterns follow gitignore syntax:
*matches anything except/**matches zero or more directories?matches any single character[abc]matches any character in the set