simbarag

Author	SHA1	Message	Date
Ryan Chen	9f51dc3cdb	Fix NOT NULL violation on empty splits and increase search results to k=6 Empty documents after sanitization caused aadd_documents to issue a DEFAULT VALUES insert. Guard with an emptiness check. Also increase similarity search k from 2 to 6 so multi-word queries like full names have better recall. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-31 07:16:28 -04:00
Ryan Chen	1e6bc536b4	Fix datetime serialization in Obsidian metadata for pgvector YAML frontmatter can contain datetime objects which aren't JSON serializable. Add _make_serializable() to coerce all metadata values before storing in pgvector. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-31 07:11:29 -04:00
Ryan Chen	869de1c250	Add incremental Obsidian-to-pgvector sync with background watcher Replace full delete-and-reindex with mtime-based incremental sync that only re-indexes changed/new files and removes deleted ones. A background polling task keeps the vector store up-to-date automatically when OBSIDIAN_CONTINUOUS_SYNC=true. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-31 07:05:48 -04:00
Ryan Chen	9629bfcef4	Fix embedding tokenizer mismatch with custom embedding server Disable tiktoken pre-encoding for custom embedding servers. LangChain was encoding text into OpenAI token IDs then sending them to llama-server which has a different vocabulary, causing "invalid tokens" errors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-11 23:42:23 -04:00
Ryan Chen	b4097730ef	Add per-chunk error logging and broaden text sanitizer Indexes chunks one at a time with error logging to identify which document/chunk causes embedding failures. Also strips Unicode surrogates and replacement characters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-11 23:38:03 -04:00
Ryan Chen	abb06b78e2	Sanitize document text before embedding to fix tokenizer errors Strips null bytes, control characters, and excessive whitespace from document content before sending to embedding models. Fixes 400 errors from BERT-based tokenizers (e.g. nomic-embed-text) on PDF-extracted text. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-11 23:35:25 -04:00
Ryan Chen	92171cbfb6	Support custom OpenAI-compatible embedding server with OpenAI fallback Adds EMBEDDING_SERVER_URL and EMBEDDING_MODEL_NAME env vars, mirroring the existing LLAMA_SERVER_URL pattern for LLM configuration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-11 23:24:54 -04:00
Ryan Chen	564a9b68a5	Enable async_mode on PGVector for async method support Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-24 08:53:21 -04:00
Ryan Chen	c157c37cde	Handle missing pgvector tables on first run _get_collection_id now catches the UndefinedTable error that occurs before the first index operation creates the langchain tables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-24 08:49:00 -04:00
Ryan Chen	438399646f	Replace ChromaDB with pgvector for vector storage Consolidate onto PostgreSQL by using pgvector instead of a separate ChromaDB instance. This removes a Docker volume, a large dependency, and simplifies the stack without meaningful performance impact at our document scale. - Swap langchain-chroma for langchain-postgres (PGVector) - Use pgvector/pgvector:pg16 Docker image with init script - Lazy-initialize vector store to avoid eager DB connections - Add SQL helpers for stats/delete/list (replacing _collection access) - Remove legacy main.py, chunker, petmd scraper, and /api/query endpoint Re-index required after deploy (POST /api/rag/index + /index-obsidian). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-24 08:43:52 -04:00
ryan	86cc269b3a	yeet	2026-03-03 08:23:31 -05:00
Ryan Chen	6ae36b51a0	ynab update	2026-01-31 22:47:43 -05:00
Ryan Chen	ad39904dda	reorganization	2026-01-31 17:13:27 -05:00

13 Commits