Commit Graph

16 Commits

Author SHA1 Message Date
Ryan Chen add9946bc2 Improve Obsidian RAG retrieval for large vaults
- Markdown-aware chunking (split on headers before size-based splitting)
- Prepend note filename to each chunk for self-contained context
- Source-filtered retrieval (obsidian/paperless queries stay isolated)
- MMR search with k=8, fetch_k=24 for better recall and diversity
- Add source metadata to Paperless docs and folder metadata to Obsidian docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-04 13:34:15 -04:00
Ryan Chen 9a149cdaa6 Use in-memory cache for obsidian indexed files instead of cross-engine DB query
The async/sync engine split caused visibility issues where newly indexed
files weren't found on the next cycle, triggering re-indexing of all 36
files every 60 seconds. Replace with a module-level dict that loads from
DB on cold start and stays in sync via cache updates after each indexing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-31 07:37:39 -04:00
Ryan Chen 00c9b44c0e Preserve wikilink text in Obsidian indexing and fix duplicate sync
Two fixes:
- Convert wikilinks to display text instead of stripping them entirely.
  [[Noah]] becomes "Noah", [[target|display]] becomes "display". This
  was causing names and references in wikilinks to be invisible to search.
- Switch _get_obsidian_indexed_files to async engine to avoid stale reads
  from the separate sync engine, which caused files to be re-indexed
  every cycle.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-31 07:35:24 -04:00
Ryan Chen 9f51dc3cdb Fix NOT NULL violation on empty splits and increase search results to k=6
Empty documents after sanitization caused aadd_documents to issue a
DEFAULT VALUES insert. Guard with an emptiness check. Also increase
similarity search k from 2 to 6 so multi-word queries like full names
have better recall.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-31 07:16:28 -04:00
Ryan Chen 1e6bc536b4 Fix datetime serialization in Obsidian metadata for pgvector
YAML frontmatter can contain datetime objects which aren't JSON
serializable. Add _make_serializable() to coerce all metadata values
before storing in pgvector.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-31 07:11:29 -04:00
Ryan Chen 869de1c250 Add incremental Obsidian-to-pgvector sync with background watcher
Replace full delete-and-reindex with mtime-based incremental sync that
only re-indexes changed/new files and removes deleted ones. A background
polling task keeps the vector store up-to-date automatically when
OBSIDIAN_CONTINUOUS_SYNC=true.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-31 07:05:48 -04:00
Ryan Chen 9629bfcef4 Fix embedding tokenizer mismatch with custom embedding server
Disable tiktoken pre-encoding for custom embedding servers. LangChain
was encoding text into OpenAI token IDs then sending them to llama-server
which has a different vocabulary, causing "invalid tokens" errors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-11 23:42:23 -04:00
Ryan Chen b4097730ef Add per-chunk error logging and broaden text sanitizer
Indexes chunks one at a time with error logging to identify which
document/chunk causes embedding failures. Also strips Unicode surrogates
and replacement characters.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-11 23:38:03 -04:00
Ryan Chen abb06b78e2 Sanitize document text before embedding to fix tokenizer errors
Strips null bytes, control characters, and excessive whitespace from
document content before sending to embedding models. Fixes 400 errors
from BERT-based tokenizers (e.g. nomic-embed-text) on PDF-extracted text.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-11 23:35:25 -04:00
Ryan Chen 92171cbfb6 Support custom OpenAI-compatible embedding server with OpenAI fallback
Adds EMBEDDING_SERVER_URL and EMBEDDING_MODEL_NAME env vars, mirroring
the existing LLAMA_SERVER_URL pattern for LLM configuration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-11 23:24:54 -04:00
Ryan Chen 564a9b68a5 Enable async_mode on PGVector for async method support
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-24 08:53:21 -04:00
Ryan Chen c157c37cde Handle missing pgvector tables on first run
_get_collection_id now catches the UndefinedTable error that occurs
before the first index operation creates the langchain tables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-24 08:49:00 -04:00
Ryan Chen 438399646f Replace ChromaDB with pgvector for vector storage
Consolidate onto PostgreSQL by using pgvector instead of a separate
ChromaDB instance. This removes a Docker volume, a large dependency,
and simplifies the stack without meaningful performance impact at
our document scale.

- Swap langchain-chroma for langchain-postgres (PGVector)
- Use pgvector/pgvector:pg16 Docker image with init script
- Lazy-initialize vector store to avoid eager DB connections
- Add SQL helpers for stats/delete/list (replacing _collection access)
- Remove legacy main.py, chunker, petmd scraper, and /api/query endpoint

Re-index required after deploy (POST /api/rag/index + /index-obsidian).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-24 08:43:52 -04:00
ryan 86cc269b3a yeet 2026-03-03 08:23:31 -05:00
Ryan Chen 6ae36b51a0 ynab update 2026-01-31 22:47:43 -05:00
Ryan Chen ad39904dda reorganization 2026-01-31 17:13:27 -05:00