2.7 KiB
2.7 KiB
Vector Store Management
This document describes how to manage the ChromaDB vector store used for RAG (Retrieval-Augmented Generation).
Configuration
The vector store location is controlled by the CHROMADB_PATH environment variable:
- Development (local): Set in
.envto a local path (e.g.,/path/to/chromadb) - Docker: Automatically set to
/app/data/chromadband persisted via Docker volume
Management Commands
CLI (Command Line)
Use the manage_vectorstore.py script for vector store operations:
# Show statistics
python manage_vectorstore.py stats
# Index documents from Paperless-NGX (incremental)
python manage_vectorstore.py index
# Clear and reindex all documents
python manage_vectorstore.py reindex
# List documents
python manage_vectorstore.py list 10
python manage_vectorstore.py list 20 --show-content
Docker
Run commands inside the Docker container:
# Show statistics
docker compose -f docker-compose.dev.yml exec -T raggr python manage_vectorstore.py stats
# Reindex all documents
docker compose -f docker-compose.dev.yml exec -T raggr python manage_vectorstore.py reindex
API Endpoints
The following authenticated endpoints are available:
GET /api/rag/stats- Get vector store statisticsPOST /api/rag/index- Trigger indexing of new documentsPOST /api/rag/reindex- Clear and reindex all documents
How It Works
- Document Fetching: Documents are fetched from Paperless-NGX via the API
- Chunking: Documents are split into chunks of ~1000 characters with 200 character overlap
- Embedding: Chunks are embedded using OpenAI's
text-embedding-3-largemodel - Storage: Embeddings are stored in ChromaDB with metadata (filename, document type, date)
- Retrieval: User queries are embedded and similar chunks are retrieved for RAG
Troubleshooting
"Error creating hnsw segment reader"
This indicates a corrupted index. Solution:
python manage_vectorstore.py reindex
Empty results
Check if documents are indexed:
python manage_vectorstore.py stats
If count is 0, run:
python manage_vectorstore.py index
Different results in Docker vs local
Docker and local environments use separate ChromaDB instances. To sync:
- Index inside Docker:
docker compose exec -T raggr python manage_vectorstore.py reindex - Or mount the same volume for both environments
Production Considerations
- Volume Persistence: Use Docker volumes or persistent storage for ChromaDB
- Backup: Regularly backup the ChromaDB data directory
- Reindexing: Schedule periodic reindexing to keep data fresh
- Monitoring: Monitor the
/api/rag/statsendpoint for document counts