Files
simbarag/blueprints
ryan abb06b78e2 Sanitize document text before embedding to fix tokenizer errors
Strips null bytes, control characters, and excessive whitespace from
document content before sending to embedding models. Fixes 400 errors
from BERT-based tokenizers (e.g. nomic-embed-text) on PDF-extracted text.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-11 23:35:25 -04:00
..
2026-01-31 17:13:27 -05:00