Files
simbarag/docs/VECTORSTORE.md
2026-01-31 17:13:27 -05:00

2.7 KiB

Vector Store Management

This document describes how to manage the ChromaDB vector store used for RAG (Retrieval-Augmented Generation).

Configuration

The vector store location is controlled by the CHROMADB_PATH environment variable:

  • Development (local): Set in .env to a local path (e.g., /path/to/chromadb)
  • Docker: Automatically set to /app/data/chromadb and persisted via Docker volume

Management Commands

CLI (Command Line)

Use the scripts/manage_vectorstore.py script for vector store operations:

# Show statistics
python scripts/manage_vectorstore.py stats

# Index documents from Paperless-NGX (incremental)
python scripts/manage_vectorstore.py index

# Clear and reindex all documents
python scripts/manage_vectorstore.py reindex

# List documents
python scripts/manage_vectorstore.py list 10
python scripts/manage_vectorstore.py list 20 --show-content

Docker

Run commands inside the Docker container:

# Show statistics
docker compose exec raggr python scripts/manage_vectorstore.py stats

# Reindex all documents
docker compose exec raggr python scripts/manage_vectorstore.py reindex

API Endpoints

The following authenticated endpoints are available:

  • GET /api/rag/stats - Get vector store statistics
  • POST /api/rag/index - Trigger indexing of new documents
  • POST /api/rag/reindex - Clear and reindex all documents

How It Works

  1. Document Fetching: Documents are fetched from Paperless-NGX via the API
  2. Chunking: Documents are split into chunks of ~1000 characters with 200 character overlap
  3. Embedding: Chunks are embedded using OpenAI's text-embedding-3-large model
  4. Storage: Embeddings are stored in ChromaDB with metadata (filename, document type, date)
  5. Retrieval: User queries are embedded and similar chunks are retrieved for RAG

Troubleshooting

"Error creating hnsw segment reader"

This indicates a corrupted index. Solution:

python scripts/manage_vectorstore.py reindex

Empty results

Check if documents are indexed:

python scripts/manage_vectorstore.py stats

If count is 0, run:

python scripts/manage_vectorstore.py index

Different results in Docker vs local

Docker and local environments use separate ChromaDB instances. To sync:

  1. Index inside Docker: docker compose exec raggr python scripts/manage_vectorstore.py reindex
  2. Or mount the same volume for both environments

Production Considerations

  1. Volume Persistence: Use Docker volumes or persistent storage for ChromaDB
  2. Backup: Regularly backup the ChromaDB data directory
  3. Reindexing: Schedule periodic reindexing to keep data fresh
  4. Monitoring: Monitor the /api/rag/stats endpoint for document counts