simbarag

Author	SHA1	Message	Date
Ryan Chen	add9946bc2	Improve Obsidian RAG retrieval for large vaults - Markdown-aware chunking (split on headers before size-based splitting) - Prepend note filename to each chunk for self-contained context - Source-filtered retrieval (obsidian/paperless queries stay isolated) - MMR search with k=8, fetch_k=24 for better recall and diversity - Add source metadata to Paperless docs and folder metadata to Obsidian docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-04 13:34:15 -04:00
Ryan Chen	9a149cdaa6	Use in-memory cache for obsidian indexed files instead of cross-engine DB query The async/sync engine split caused visibility issues where newly indexed files weren't found on the next cycle, triggering re-indexing of all 36 files every 60 seconds. Replace with a module-level dict that loads from DB on cold start and stays in sync via cache updates after each indexing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-31 07:37:39 -04:00
Ryan Chen	00c9b44c0e	Preserve wikilink text in Obsidian indexing and fix duplicate sync Two fixes: - Convert wikilinks to display text instead of stripping them entirely. [[Noah]] becomes "Noah", [[target\|display]] becomes "display". This was causing names and references in wikilinks to be invisible to search. - Switch _get_obsidian_indexed_files to async engine to avoid stale reads from the separate sync engine, which caused files to be re-indexed every cycle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-31 07:35:24 -04:00
Ryan Chen	9f51dc3cdb	Fix NOT NULL violation on empty splits and increase search results to k=6 Empty documents after sanitization caused aadd_documents to issue a DEFAULT VALUES insert. Guard with an emptiness check. Also increase similarity search k from 2 to 6 so multi-word queries like full names have better recall. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-31 07:16:28 -04:00
Ryan Chen	1e6bc536b4	Fix datetime serialization in Obsidian metadata for pgvector YAML frontmatter can contain datetime objects which aren't JSON serializable. Add _make_serializable() to coerce all metadata values before storing in pgvector. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-31 07:11:29 -04:00
Ryan Chen	869de1c250	Add incremental Obsidian-to-pgvector sync with background watcher Replace full delete-and-reindex with mtime-based incremental sync that only re-indexes changed/new files and removes deleted ones. A background polling task keeps the vector store up-to-date automatically when OBSIDIAN_CONTINUOUS_SYNC=true. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-31 07:05:48 -04:00
Ryan Chen	9629bfcef4	Fix embedding tokenizer mismatch with custom embedding server Disable tiktoken pre-encoding for custom embedding servers. LangChain was encoding text into OpenAI token IDs then sending them to llama-server which has a different vocabulary, causing "invalid tokens" errors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-11 23:42:23 -04:00
Ryan Chen	b4097730ef	Add per-chunk error logging and broaden text sanitizer Indexes chunks one at a time with error logging to identify which document/chunk causes embedding failures. Also strips Unicode surrogates and replacement characters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-11 23:38:03 -04:00
Ryan Chen	abb06b78e2	Sanitize document text before embedding to fix tokenizer errors Strips null bytes, control characters, and excessive whitespace from document content before sending to embedding models. Fixes 400 errors from BERT-based tokenizers (e.g. nomic-embed-text) on PDF-extracted text. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-11 23:35:25 -04:00
Ryan Chen	92171cbfb6	Support custom OpenAI-compatible embedding server with OpenAI fallback Adds EMBEDDING_SERVER_URL and EMBEDDING_MODEL_NAME env vars, mirroring the existing LLAMA_SERVER_URL pattern for LLM configuration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-11 23:24:54 -04:00
Ryan Chen	564a9b68a5	Enable async_mode on PGVector for async method support Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-24 08:53:21 -04:00
Ryan Chen	c157c37cde	Handle missing pgvector tables on first run _get_collection_id now catches the UndefinedTable error that occurs before the first index operation creates the langchain tables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-24 08:49:00 -04:00
Ryan Chen	438399646f	Replace ChromaDB with pgvector for vector storage Consolidate onto PostgreSQL by using pgvector instead of a separate ChromaDB instance. This removes a Docker volume, a large dependency, and simplifies the stack without meaningful performance impact at our document scale. - Swap langchain-chroma for langchain-postgres (PGVector) - Use pgvector/pgvector:pg16 Docker image with init script - Lazy-initialize vector store to avoid eager DB connections - Add SQL helpers for stats/delete/list (replacing _collection access) - Remove legacy main.py, chunker, petmd scraper, and /api/query endpoint Re-index required after deploy (POST /api/rag/index + /index-obsidian). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-24 08:43:52 -04:00
ryan	86cc269b3a	yeet	2026-03-03 08:23:31 -05:00
Ryan Chen	6ae36b51a0	ynab update	2026-01-31 22:47:43 -05:00
Ryan Chen	ad39904dda	reorganization	2026-01-31 17:13:27 -05:00

16 Commits