- Markdown-aware chunking (split on headers before size-based splitting)
- Prepend note filename to each chunk for self-contained context
- Source-filtered retrieval (obsidian/paperless queries stay isolated)
- MMR search with k=8, fetch_k=24 for better recall and diversity
- Add source metadata to Paperless docs and folder metadata to Obsidian docs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The async/sync engine split caused visibility issues where newly indexed
files weren't found on the next cycle, triggering re-indexing of all 36
files every 60 seconds. Replace with a module-level dict that loads from
DB on cold start and stays in sync via cache updates after each indexing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes:
- Convert wikilinks to display text instead of stripping them entirely.
[[Noah]] becomes "Noah", [[target|display]] becomes "display". This
was causing names and references in wikilinks to be invisible to search.
- Switch _get_obsidian_indexed_files to async engine to avoid stale reads
from the separate sync engine, which caused files to be re-indexed
every cycle.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Empty documents after sanitization caused aadd_documents to issue a
DEFAULT VALUES insert. Guard with an emptiness check. Also increase
similarity search k from 2 to 6 so multi-word queries like full names
have better recall.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
YAML frontmatter can contain datetime objects which aren't JSON
serializable. Add _make_serializable() to coerce all metadata values
before storing in pgvector.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace full delete-and-reindex with mtime-based incremental sync that
only re-indexes changed/new files and removes deleted ones. A background
polling task keeps the vector store up-to-date automatically when
OBSIDIAN_CONTINUOUS_SYNC=true.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Disable tiktoken pre-encoding for custom embedding servers. LangChain
was encoding text into OpenAI token IDs then sending them to llama-server
which has a different vocabulary, causing "invalid tokens" errors.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Indexes chunks one at a time with error logging to identify which
document/chunk causes embedding failures. Also strips Unicode surrogates
and replacement characters.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Strips null bytes, control characters, and excessive whitespace from
document content before sending to embedding models. Fixes 400 errors
from BERT-based tokenizers (e.g. nomic-embed-text) on PDF-extracted text.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds EMBEDDING_SERVER_URL and EMBEDDING_MODEL_NAME env vars, mirroring
the existing LLAMA_SERVER_URL pattern for LLM configuration.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_get_collection_id now catches the UndefinedTable error that occurs
before the first index operation creates the langchain tables.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Consolidate onto PostgreSQL by using pgvector instead of a separate
ChromaDB instance. This removes a Docker volume, a large dependency,
and simplifies the stack without meaningful performance impact at
our document scale.
- Swap langchain-chroma for langchain-postgres (PGVector)
- Use pgvector/pgvector:pg16 Docker image with init script
- Lazy-initialize vector store to avoid eager DB connections
- Add SQL helpers for stats/delete/list (replacing _collection access)
- Remove legacy main.py, chunker, petmd scraper, and /api/query endpoint
Re-index required after deploy (POST /api/rag/index + /index-obsidian).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>