linter
This commit is contained in:
97
services/raggr/VECTORSTORE.md
Normal file
97
services/raggr/VECTORSTORE.md
Normal file
@@ -0,0 +1,97 @@
|
||||
# Vector Store Management
|
||||
|
||||
This document describes how to manage the ChromaDB vector store used for RAG (Retrieval-Augmented Generation).
|
||||
|
||||
## Configuration
|
||||
|
||||
The vector store location is controlled by the `CHROMADB_PATH` environment variable:
|
||||
|
||||
- **Development (local)**: Set in `.env` to a local path (e.g., `/path/to/chromadb`)
|
||||
- **Docker**: Automatically set to `/app/data/chromadb` and persisted via Docker volume
|
||||
|
||||
## Management Commands
|
||||
|
||||
### CLI (Command Line)
|
||||
|
||||
Use the `manage_vectorstore.py` script for vector store operations:
|
||||
|
||||
```bash
|
||||
# Show statistics
|
||||
python manage_vectorstore.py stats
|
||||
|
||||
# Index documents from Paperless-NGX (incremental)
|
||||
python manage_vectorstore.py index
|
||||
|
||||
# Clear and reindex all documents
|
||||
python manage_vectorstore.py reindex
|
||||
|
||||
# List documents
|
||||
python manage_vectorstore.py list 10
|
||||
python manage_vectorstore.py list 20 --show-content
|
||||
```
|
||||
|
||||
### Docker
|
||||
|
||||
Run commands inside the Docker container:
|
||||
|
||||
```bash
|
||||
# Show statistics
|
||||
docker compose -f docker-compose.dev.yml exec -T raggr python manage_vectorstore.py stats
|
||||
|
||||
# Reindex all documents
|
||||
docker compose -f docker-compose.dev.yml exec -T raggr python manage_vectorstore.py reindex
|
||||
```
|
||||
|
||||
### API Endpoints
|
||||
|
||||
The following authenticated endpoints are available:
|
||||
|
||||
- `GET /api/rag/stats` - Get vector store statistics
|
||||
- `POST /api/rag/index` - Trigger indexing of new documents
|
||||
- `POST /api/rag/reindex` - Clear and reindex all documents
|
||||
|
||||
## How It Works
|
||||
|
||||
1. **Document Fetching**: Documents are fetched from Paperless-NGX via the API
|
||||
2. **Chunking**: Documents are split into chunks of ~1000 characters with 200 character overlap
|
||||
3. **Embedding**: Chunks are embedded using OpenAI's `text-embedding-3-large` model
|
||||
4. **Storage**: Embeddings are stored in ChromaDB with metadata (filename, document type, date)
|
||||
5. **Retrieval**: User queries are embedded and similar chunks are retrieved for RAG
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "Error creating hnsw segment reader"
|
||||
|
||||
This indicates a corrupted index. Solution:
|
||||
|
||||
```bash
|
||||
python manage_vectorstore.py reindex
|
||||
```
|
||||
|
||||
### Empty results
|
||||
|
||||
Check if documents are indexed:
|
||||
|
||||
```bash
|
||||
python manage_vectorstore.py stats
|
||||
```
|
||||
|
||||
If count is 0, run:
|
||||
|
||||
```bash
|
||||
python manage_vectorstore.py index
|
||||
```
|
||||
|
||||
### Different results in Docker vs local
|
||||
|
||||
Docker and local environments use separate ChromaDB instances. To sync:
|
||||
|
||||
1. Index inside Docker: `docker compose exec -T raggr python manage_vectorstore.py reindex`
|
||||
2. Or mount the same volume for both environments
|
||||
|
||||
## Production Considerations
|
||||
|
||||
1. **Volume Persistence**: Use Docker volumes or persistent storage for ChromaDB
|
||||
2. **Backup**: Regularly backup the ChromaDB data directory
|
||||
3. **Reindexing**: Schedule periodic reindexing to keep data fresh
|
||||
4. **Monitoring**: Monitor the `/api/rag/stats` endpoint for document counts
|
||||
Reference in New Issue
Block a user