reorganization

2026-01-31 17:13:27 -05:00
parent 1fd2e860b2
commit ad39904dda
87 changed files with 1019 additions and 237 deletions
@@ -0,0 +1,97 @@
+# Vector Store Management
+
+This document describes how to manage the ChromaDB vector store used for RAG (Retrieval-Augmented Generation).
+
+## Configuration
+
+The vector store location is controlled by the `CHROMADB_PATH` environment variable:
+
+- **Development (local)**: Set in `.env` to a local path (e.g., `/path/to/chromadb`)
+- **Docker**: Automatically set to `/app/data/chromadb` and persisted via Docker volume
+
+## Management Commands
+
+### CLI (Command Line)
+
+Use the `scripts/manage_vectorstore.py` script for vector store operations:
+
+```bash
+# Show statistics
+python scripts/manage_vectorstore.py stats
+
+# Index documents from Paperless-NGX (incremental)
+python scripts/manage_vectorstore.py index
+
+# Clear and reindex all documents
+python scripts/manage_vectorstore.py reindex
+
+# List documents
+python scripts/manage_vectorstore.py list 10
+python scripts/manage_vectorstore.py list 20 --show-content
+```
+
+### Docker
+
+Run commands inside the Docker container:
+
+```bash
+# Show statistics
+docker compose exec raggr python scripts/manage_vectorstore.py stats
+
+# Reindex all documents
+docker compose exec raggr python scripts/manage_vectorstore.py reindex
+```
+
+### API Endpoints
+
+The following authenticated endpoints are available:
+
+- `GET /api/rag/stats` - Get vector store statistics
+- `POST /api/rag/index` - Trigger indexing of new documents
+- `POST /api/rag/reindex` - Clear and reindex all documents
+
+## How It Works
+
+1. **Document Fetching**: Documents are fetched from Paperless-NGX via the API
+2. **Chunking**: Documents are split into chunks of ~1000 characters with 200 character overlap
+3. **Embedding**: Chunks are embedded using OpenAI's `text-embedding-3-large` model
+4. **Storage**: Embeddings are stored in ChromaDB with metadata (filename, document type, date)
+5. **Retrieval**: User queries are embedded and similar chunks are retrieved for RAG
+
+## Troubleshooting
+
+### "Error creating hnsw segment reader"
+
+This indicates a corrupted index. Solution:
+
+```bash
+python scripts/manage_vectorstore.py reindex
+```
+
+### Empty results
+
+Check if documents are indexed:
+
+```bash
+python scripts/manage_vectorstore.py stats
+```
+
+If count is 0, run:
+
+```bash
+python scripts/manage_vectorstore.py index
+```
+
+### Different results in Docker vs local
+
+Docker and local environments use separate ChromaDB instances. To sync:
+
+1. Index inside Docker: `docker compose exec raggr python scripts/manage_vectorstore.py reindex`
+2. Or mount the same volume for both environments
+
+## Production Considerations
+
+1. **Volume Persistence**: Use Docker volumes or persistent storage for ChromaDB
+2. **Backup**: Regularly backup the ChromaDB data directory
+3. **Reindexing**: Schedule periodic reindexing to keep data fresh
+4. **Monitoring**: Monitor the `/api/rag/stats` endpoint for document counts