Switch to OpenAI embeddings for ChromaDB

Replace Ollama embedding function with OpenAI's text-embedding-3-small model for improved embedding quality and consistency. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Configure Docker for Linux host networking and add startup reindex
2025-10-02 21:05:17 -04:00 · 2025-10-02 21:02:55 -04:00 · 2025-10-02 20:57:19 -04:00 · 2025-10-02 20:48:52 -04:00 · 2025-10-02 20:46:10 -04:00 · 2025-10-02 20:44:49 -04:00
25 changed files with 4159 additions and 79 deletions
--- a/.dockerignore
+++ b/.dockerignore
@@ -0,0 +1,16 @@
 .git
 .gitignore
 README.md
 .env
 .DS_Store
 chromadb/
 chroma_db/
 raggr-frontend/node_modules/
 __pycache__/
 *.pyc
 *.pyo
 *.pyd
 .Python
 .venv/
 venv/
 .pytest_cache/
--- a/.python-version
+++ b/.python-version
@@ -0,0 +1 @@
 3.13
--- a/46
+++ b/46
@@ -0,0 +1,46 @@
 FROM python:3.13-slim
 WORKDIR /app
 # Install system dependencies, Node.js, Yarn, and uv
 RUN apt-get update && apt-get install -y \
    build-essential \
    curl \
    && curl -fsSL https://deb.nodesource.com/setup_20.x | bash - \
    && apt-get install -y nodejs \
    && npm install -g yarn \
    && rm -rf /var/lib/apt/lists/* \
    && curl -LsSf https://astral.sh/uv/install.sh | sh
 # Add uv to PATH
 ENV PATH="/root/.local/bin:$PATH"
 # Copy dependency files
 COPY pyproject.toml ./
 # Install Python dependencies using uv
 RUN uv pip install --system -e .
 # Copy application code
 COPY *.py ./
 COPY startup.sh ./
 RUN chmod +x startup.sh
 # Copy frontend code and build
 COPY raggr-frontend ./raggr-frontend
 WORKDIR /app/raggr-frontend
 RUN yarn install && yarn build
 WORKDIR /app
 # Create ChromaDB directory
 RUN mkdir -p /app/chromadb
 # Expose port
 EXPOSE 8080
 # Set environment variables
 ENV PYTHONPATH=/app
 ENV CHROMADB_PATH=/app/chromadb
 # Run the startup script
 CMD ["./startup.sh"]
--- a/app.py
+++ b/app.py
@@ -0,0 +1,37 @@
 import os
 from flask import Flask, request, jsonify, render_template, send_from_directory
 from main import consult_simba_oracle
 app = Flask(__name__, static_folder="raggr-frontend/dist/static", template_folder="raggr-frontend/dist")
 # Serve React static files
@app.route('/static/<path:filename>')
 def static_files(filename):
    return send_from_directory(app.static_folder, filename)
 # Serve the React app for all routes (catch-all)
@app.route('/', defaults={'path': ''})
@app.route('/<path:path>')
 def serve_react_app(path):
    if path and os.path.exists(os.path.join(app.template_folder, path)):
        return send_from_directory(app.template_folder, path)
    return render_template('index.html')
@app.route("/api/query", methods=["POST"])
 def query():
    data  = request.get_json()
    query = data.get("query")
    return jsonify({"response": consult_simba_oracle(query)})
@app.route("/api/ingest", methods=["POST"])
 def webhook():
    data = request.get_json()
    print(data)
    return jsonify({"status": "received"})
 if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8080, debug=True)
--- a/chunker.py
+++ b/chunker.py
@@ -0,0 +1,134 @@
 import os
 from math import ceil
 import re
 from typing import Union
 from uuid import UUID, uuid4
 from chromadb.utils.embedding_functions.openai_embedding_function import (
    OpenAIEmbeddingFunction,
 )
 from dotenv import load_dotenv
 load_dotenv()
 def remove_headers_footers(text, header_patterns=None, footer_patterns=None):
    if header_patterns is None:
        header_patterns = [r"^.*Header.*$"]
    if footer_patterns is None:
        footer_patterns = [r"^.*Footer.*$"]
    for pattern in header_patterns + footer_patterns:
        text = re.sub(pattern, "", text, flags=re.MULTILINE)
    return text.strip()
 def remove_special_characters(text, special_chars=None):
    if special_chars is None:
        special_chars = r"[^A-Za-z0-9\s\.,;:\'\"\?\!\-]"
    text = re.sub(special_chars, "", text)
    return text.strip()
 def remove_repeated_substrings(text, pattern=r"\.{2,}"):
    text = re.sub(pattern, ".", text)
    return text.strip()
 def remove_extra_spaces(text):
    text = re.sub(r"\n\s*\n", "\n\n", text)
    text = re.sub(r"\s+", " ", text)
    return text.strip()
 def preprocess_text(text):
    # Remove headers and footers
    text = remove_headers_footers(text)
    # Remove special characters
    text = remove_special_characters(text)
    # Remove repeated substrings like dots
    text = remove_repeated_substrings(text)
    # Remove extra spaces between lines and within lines
    text = remove_extra_spaces(text)
    # Additional cleaning steps can be added here
    return text.strip()
 class Chunk:
    def __init__(
        self,
        text: str,
        size: int,
        document_id: UUID,
        chunk_id: int,
        embedding,
    ):
        self.text = text
        self.size = size
        self.document_id = document_id
        self.chunk_id = chunk_id
        self.embedding = embedding
 class Chunker:
    embedding_fx = OpenAIEmbeddingFunction(
        api_key=os.getenv("OPENAI_API_KEY"),
        model_name="text-embedding-3-small",
    )
    def __init__(self, collection) -> None:
        self.collection = collection
    def chunk_document(
        self,
        document: str,
        chunk_size: int = 1000,
        metadata: dict[str, Union[str, float]] = {},
    ) -> list[Chunk]:
        doc_uuid = uuid4()
        chunk_size = min(chunk_size, len(document)) or 1
        chunks = []
        num_chunks = ceil(len(document) / chunk_size)
        document_length = len(document)
        for i in range(num_chunks):
            curr_pos = i * num_chunks
            to_pos = (
                curr_pos + chunk_size
                if curr_pos + chunk_size < document_length
                else document_length
            )
            text_chunk = self.clean_document(document[curr_pos:to_pos])
            embedding = self.embedding_fx([text_chunk])
            self.collection.add(
                ids=[str(doc_uuid) + ":" + str(i)],
                documents=[text_chunk],
                embeddings=embedding,
                metadatas=[metadata],
            )
        return chunks
    def clean_document(self, document: str) -> str:
        """This function will remove information that is noise or already known.
        Example: We already know all the things in here are Simba-related, so we don't need things like
        "Sumamry of simba's visit"
        """
        document = document.replace("\\n", "")
        document = document.strip()
        return preprocess_text(document)
--- a/cleaner.py
+++ b/cleaner.py
@@ -0,0 +1,165 @@
 import os
 import sys
 import tempfile
 import argparse
 from dotenv import load_dotenv
 import ollama
 from PIL import Image
 import fitz
 from request import PaperlessNGXService
 load_dotenv()
 # Configure ollama client with URL from environment or default to localhost
 ollama_client = ollama.Client(host=os.getenv("OLLAMA_URL", "http://localhost:11434"))
 parser = argparse.ArgumentParser(description="use llm to clean documents")
 parser.add_argument("document_id", type=str, help="questions about simba's health")
 def pdf_to_image(filepath: str, dpi=300) -> list[str]:
    """Returns the filepaths to the created images"""
    image_temp_files = []
    try:
        pdf_document = fitz.open(filepath)
        print(f"\nConverting '{os.path.basename(filepath)}' to temporary images...")
        for page_num in range(len(pdf_document)):
            page = pdf_document.load_page(page_num)
            zoom = dpi / 72
            mat = fitz.Matrix(zoom, zoom)
            pix = page.get_pixmap(matrix=mat)
            # Create a temporary file for the image. delete=False is crucial.
            with tempfile.NamedTemporaryFile(
                delete=False,
                suffix=".png",
                prefix=f"pdf_page_{page_num + 1}_",
            ) as temp_image_file:
                temp_image_path = temp_image_file.name
            # Save the pixel data to the temporary file
            pix.save(temp_image_path)
            image_temp_files.append(temp_image_path)
            print(
                f"  -> Saved page {page_num + 1} to temporary file: '{temp_image_path}'"
            )
        print("\nConversion successful! ✨")
        return image_temp_files
    except Exception as e:
        print(f"An error occurred during PDF conversion: {e}", file=sys.stderr)
        # Clean up any image files that were created before the error
        for path in image_temp_files:
            os.remove(path)
        return []
 def merge_images_vertically_to_tempfile(image_paths):
    """
    Merges a list of images vertically and saves the result to a temporary file.
    Args:
        image_paths (list): A list of strings, where each string is the
                            filepath to an image.
    Returns:
        str: The filepath of the temporary merged image file.
    """
    if not image_paths:
        print("Error: The list of image paths is empty.")
        return None
    # Open all images and check for consistency
    try:
        images = [Image.open(path) for path in image_paths]
    except FileNotFoundError as e:
        print(f"Error: Could not find image file: {e}")
        return None
    widths, heights = zip(*(img.size for img in images))
    max_width = max(widths)
    # All images must have the same width
    if not all(width == max_width for width in widths):
        print("Warning: Images have different widths. They will be resized.")
        resized_images = []
        for img in images:
            if img.size[0] != max_width:
                img = img.resize(
                    (max_width, int(img.size[1] * (max_width / img.size[0])))
                )
            resized_images.append(img)
        images = resized_images
        heights = [img.size[1] for img in images]
    # Calculate the total height of the merged image
    total_height = sum(heights)
    # Create a new blank image with the combined dimensions
    merged_image = Image.new("RGB", (max_width, total_height))
    # Paste each image onto the new blank image
    y_offset = 0
    for img in images:
        merged_image.paste(img, (0, y_offset))
        y_offset += img.height
    # Create a temporary file and save the image
    temp_file = tempfile.NamedTemporaryFile(suffix=".png", delete=False)
    temp_path = temp_file.name
    merged_image.save(temp_path)
    temp_file.close()
    print(f"Successfully merged {len(images)} images into temporary file: {temp_path}")
    return temp_path
 OCR_PROMPT = """
    You job is to extract text from the images I provide you. Extract every bit of the text in the image. Don't say anything just do your job. Text should be same as in the images. If there are multiple images, categorize the transcriptions by page.
 Things to avoid:
 - Don't miss anything to extract from the images
 Things to include:
 - Include everything, even anything inside [], (), {} or anything.
 - Include any repetitive things like "..." or anything
 - If you think there is any mistake in image just include it too
 Someone will kill the innocent kittens if you don't extract the text exactly. So, make sure you extract every bit of the text. Only output the extracted text.
 """
 def summarize_pdf_image(filepaths: list[str]):
    res = ollama_client.chat(
        model="gemma3:4b",
        messages=[
            {
                "role": "user",
                "content": OCR_PROMPT,
                "images": filepaths,
            }
        ],
    )
    return res["message"]["content"]
 if __name__ == "__main__":
    args = parser.parse_args()
    ppngx = PaperlessNGXService()
    if args.document_id:
        doc_id = args.document_id
        file = ppngx.get_doc_by_id(doc_id=doc_id)
        pdf_path = ppngx.download_pdf_from_id(doc_id)
        print(pdf_path)
        image_paths = pdf_to_image(filepath=pdf_path)
        summary = summarize_pdf_image(filepaths=image_paths)
        print(summary)
        file["content"] = summary
        print(file)
        ppngx.upload_cleaned_content(doc_id, file)
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -0,0 +1,17 @@
 version: '3.8'
 services:
  raggr:
    image: torrtle/simbarag:latest
    network_mode: host
    environment:
      - PAPERLESS_TOKEN=${PAPERLESS_TOKEN}
      - BASE_URL=${BASE_URL}
      - OLLAMA_URL=${OLLAMA_URL:-http://localhost:11434}
      - CHROMADB_PATH=/app/chromadb
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    volumes:
      - chromadb_data:/app/chromadb
 volumes:
  chromadb_data:
--- a/main.py
+++ b/main.py
@@ -1,102 +1,206 @@
-import ollama
+import datetime
 import logging
 import os
-from uuid import uuid4, UUID
+from typing import Any, Union
 import argparse
 import chromadb
 import ollama
 from openai import OpenAI
 from request import PaperlessNGXService
-
+from chunker import Chunker
-from math import ceil
+from query import QueryGenerator
-
+from cleaner import pdf_to_image, summarize_pdf_image
 import chromadb
 from chromadb.utils.embedding_functions.ollama_embedding_function import (
    OllamaEmbeddingFunction,
 )
 from dotenv import load_dotenv
 client = chromadb.EphemeralClient()
 collection = client.create_collection(name="docs")
 load_dotenv()
 # Configure ollama client with URL from environment or default to localhost
 ollama_client = ollama.Client(host=os.getenv("OLLAMA_URL", "http://localhost:11434"))
-class Chunk:
+client = chromadb.PersistentClient(path=os.getenv("CHROMADB_PATH", ""))
-    def __init__(
+simba_docs = client.get_or_create_collection(name="simba_docs")
-        self,
+feline_vet_lookup = client.get_or_create_collection(name="feline_vet_lookup")
        text: str,
        size: int,
        document_id: UUID,
        chunk_id: int,
        embedding,
    ):
        self.text = text
        self.size = size
        self.document_id = document_id
        self.chunk_id = chunk_id
        self.embedding = embedding
-
+parser = argparse.ArgumentParser(
-class Chunker:
+    description="An LLM tool to query information about Simba <3"
    def __init__(self) -> None:
        self.embedding_fx = OllamaEmbeddingFunction(
            url=os.getenv("OLLAMA_URL", ""),
            model_name="mxbai-embed-large",
 )
-        pass
+parser.add_argument("query", type=str, help="questions about simba's health")
-
+parser.add_argument(
-    def chunk_document(self, document: str, chunk_size: int = 300) -> list[Chunk]:
+    "--reindex", action="store_true", help="re-index the simba documents"
        doc_uuid = uuid4()
        chunks = []
        num_chunks = ceil(len(document) / chunk_size)
        document_length = len(document)
        for i in range(num_chunks):
            curr_pos = i * num_chunks
            to_pos = (
                curr_pos + num_chunks
                if curr_pos + num_chunks < document_length
                else document_length
 )
-            text_chunk = document[curr_pos:to_pos]
+parser.add_argument(
-
+        "--index", help="index a file"
            embedding = self.embedding_fx([text_chunk])
            collection.add(
                ids=[str(doc_uuid) + ":" + str(i)],
                documents=[text_chunk],
                embeddings=embedding,
 )
-        return chunks
+ppngx = PaperlessNGXService()
 openai_client = OpenAI()
 def index_using_pdf_llm():
    files = ppngx.get_data()
    for file in files:
        document_id = file["id"]
        pdf_path = ppngx.download_pdf_from_id(id=document_id)
        image_paths = pdf_to_image(filepath=pdf_path)
        print(f"summarizing {file}")
        generated_summary = summarize_pdf_image(filepaths=image_paths)
        file["content"] = generated_summary
    chunk_data(files, simba_docs)
-embedding_fx = OllamaEmbeddingFunction(
+def date_to_epoch(date_str: str) -> float:
-    url=os.getenv("OLLAMA_URL", ""),
+    split_date = date_str.split("-")
-    model_name="mxbai-embed-large",
+    print(split_date)
    date = datetime.datetime(
        int(split_date[0]),
        int(split_date[1]),
        int(split_date[2]),
        0,
        0,
        0,
    )
    return date.timestamp()
 def chunk_data(docs: list[dict[str, Union[str, Any]]], collection):
    # Step 2: Create chunks
    chunker = Chunker(collection)
    print(f"chunking {len(docs)} documents")
    print(docs)
    texts: list[str] = [doc["content"] for doc in docs]
    for index, text in enumerate(texts):
        print(docs[index]["original_file_name"])
        metadata = {
             "created_date": date_to_epoch(docs[index]["created_date"]),
             "filename": docs[index]["original_file_name"]
        }
        chunker.chunk_document(
            document=text,
            metadata=metadata,
        )
 def chunk_text(texts: list[str], collection):
    chunker = Chunker(collection)
    for index, text in enumerate(texts):
        metadata = {}
        chunker.chunk_document(
            document=text,
            metadata=metadata,
        )
 def consult_oracle(input: str, collection):
    print(input)
    import time
    start_time = time.time()
    # Ask
    # print("Starting query generation")
    # qg_start = time.time()
    # qg = QueryGenerator()
    # metadata_filter = qg.get_query(input)
    # qg_end = time.time()
    # print(f"Query generation took {qg_end - qg_start:.2f} seconds")
    # print(metadata_filter)
    print("Starting embedding generation")
    embedding_start = time.time()
    embeddings = Chunker.embedding_fx(input=[input])
    embedding_end = time.time()
    print(f"Embedding generation took {embedding_end - embedding_start:.2f} seconds")
    print("Starting collection query")
    query_start = time.time()
    results = collection.query(
        query_texts=[input],
        query_embeddings=embeddings,
        #where=metadata_filter,
    )
    print(results)
    query_end = time.time()
    print(f"Collection query took {query_end - query_start:.2f} seconds")
    # Generate
    print("Starting LLM generation")
    llm_start = time.time()
    # output = ollama_client.generate(
        # model="gemma3n:e4b",
        # prompt=f"You are a helpful assistant that understandings veterinary terms. Using the following data, help answer the user's query by providing as many details as possible.  Using this data: {results}. Respond to this prompt: {input}",
    # )
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that understands veterinary terms."},
            {"role": "user", "content": f"Using the following data, help answer the user's query by providing as many details as possible. Using this data: {results}. Respond to this prompt: {input}"}
        ]
    )
    llm_end = time.time()
    print(f"LLM generation took {llm_end - llm_start:.2f} seconds")
    total_time = time.time() - start_time
    print(f"Total consult_oracle execution took {total_time:.2f} seconds")
    return response.choices[0].message.content
 def paperless_workflow(input):
    # Step 1: Get the text
    ppngx = PaperlessNGXService()
    docs = ppngx.get_data()
 texts = [doc["content"] for doc in docs]
-# Step 2: Create chunks
+    chunk_data(docs, collection=simba_docs)
-chunker = Chunker()
+    consult_oracle(input, simba_docs)
 print(f"chunking {len(texts)} documents")
 for text in texts:
    chunker.chunk_document(document=text)
-# Ask
+def consult_simba_oracle(input: str):
-input = "How many teeth has Simba had removed? Who is his current vet?"
+    return consult_oracle(
-embeddings = embedding_fx(input=[input])
+        input=input,
-results = collection.query(query_texts=[input], query_embeddings=embeddings)
+        collection=simba_docs,
 print(results)
 # Generate
 output = ollama.generate(
    model="gemma3n:e4b",
    prompt=f"Using this data: {results}. Respond to this prompt: {input}",
    )
-print(output["response"])
+
 if __name__ == "__main__":
    args = parser.parse_args()
    if args.reindex:
        print("Fetching documents from Paperless-NGX")
        ppngx = PaperlessNGXService()
        docs = ppngx.get_data()
        print(docs)
        print(f"Fetched {len(docs)} documents")
        #
        print("Chunking documents now ...")
        chunk_data(docs, collection=simba_docs)
        print("Done chunking documents")
        # index_using_pdf_llm()
    if args.index:
        with open(args.index) as file:
            extension = args.index.split(".")[-1]
            if extension == "pdf":
                pdf_path = ppngx.download_pdf_from_id(id=document_id)
                image_paths = pdf_to_image(filepath=pdf_path)
                print(f"summarizing {file}")
                generated_summary = summarize_pdf_image(filepaths=image_paths)
            elif extension in [".md", ".txt"]:
                chunk_text(texts=[file.readall()], collection=simba_docs)
    if args.query:
        print("Consulting oracle ...")
        print(consult_oracle(
            input=args.query,
            collection=simba_docs,
        ))
    else:
        print("please provide a query")
--- a/petmd_scrape_index.py
+++ b/petmd_scrape_index.py
@@ -0,0 +1,24 @@
 from bs4 import BeautifulSoup
 import chromadb
 import httpx
 client = chromadb.PersistentClient(path="/Users/ryanchen/Programs/raggr/chromadb")
 # Scrape
 BASE_URL = "https://www.vet.cornell.edu"
 LIST_URL = "/departments-centers-and-institutes/cornell-feline-health-center/health-information/feline-health-topics"
 QUERY_URL = BASE_URL + LIST_URL
 r = httpx.get(QUERY_URL)
 soup = BeautifulSoup(r.text)
 container = soup.find("div", class_="field-body")
 a_s = container.find_all("a", href=True)
 new_texts = []
 for link in a_s:
    endpoint = link["href"]
    query_url = BASE_URL + endpoint
    r2 = httpx.get(query_url)
    article_soup = BeautifulSoup(r2.text)
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -4,4 +4,14 @@ version = "0.1.0"
 description = "Add your description here"
 readme = "README.md"
 requires-python = ">=3.13"
-dependencies = []
+dependencies = [
    "chromadb>=1.1.0",
    "python-dotenv>=1.0.0",
    "flask>=3.1.2",
    "httpx>=0.28.1",
    "ollama>=0.6.0",
    "openai>=2.0.1",
    "pydantic>=2.11.9",
    "pillow>=10.0.0",
    "pymupdf>=1.24.0",
 ]
--- a/query.py
+++ b/query.py
@@ -0,0 +1,141 @@
 import json
 import os
 from typing import Literal
 import datetime
 from ollama import chat, ChatResponse, Client
 from openai import OpenAI
 from pydantic import BaseModel, Field
 # Configure ollama client with URL from environment or default to localhost
 ollama_client = Client(host=os.getenv("OLLAMA_URL", "http://localhost:11434"))
 # This uses inferred filters — which means using LLM to create the metadata filters
 class FilterOperation(BaseModel):
    op: Literal["$gt", "$gte", "$eq", "$ne", "$lt", "$lte", "$in", "$nin"]
    value: str | list[str]
 class FilterQuery(BaseModel):
    field_name: Literal["created_date, tags"]
    op: FilterOperation
 class AndQuery(BaseModel):
    op: Literal["$and", "$or"]
    subqueries: list[FilterQuery]
 class GeneratedQuery(BaseModel):
    fields: list[str]
    extracted_metadata_fields: str
 class Time(BaseModel):
    time: int
 PROMPT = """
 You are an information specialist that processes user queries. The current year is 2025. The user queries are all about 
 a cat, Simba, and its records. The types of records are listed below. Using the query, extract the 
 the date range the user is trying to query. You should return it as a JSON. The date tag is created_date. Return the date in epoch time.
 If the created_date cannot be ascertained, set it to epoch time start.
 You have several operators at your disposal:
 - $gt: greater than
 - $gte: greater than or equal
 - $eq: equal
 - $ne: not equal
 - $lt: less than
 - $lte: less than or equal to
 - $in: in
 - $nin: not in
 Logical operators:
 - $and, $or
 ### Example 1
 Query: "Who is Simba's current vet?"
 Metadata fields: "{"created_date"}"
 Extracted metadata fields: {"created_date: {"$gt": "2025-01-01"}}
 ### Example 2
 Query: "How many teeth has Simba had removed?"
 Metadata fields: {}
 Extracted metadata fields: {}
 ### Example 3
 Query: "How many times has Simba been to the vet this year?"
 Metadata fields: {"created_date"}
 Extracted metadata fields: {"created_date": {"gt": "2025-01-01"}}
 document_types:
 - aftercare
 - bill
 - insurance claim
 - medical records
 Only return the extracted metadata fields. Make sure the extracted metadata fields are valid JSON
 """
 class QueryGenerator:
    def __init__(self) -> None:
        pass
    def date_to_epoch(self, date_str: str) -> float:
        split_date = date_str.split("-")
        date = datetime.datetime(
            int(split_date[0]),
            int(split_date[1]),
            int(split_date[2]),
            0,
            0,
            0,
        )
        return date.timestamp()
    def get_query(self, input: str):
        client = OpenAI()
        print(input)
        response = client.responses.parse(
            model="gpt-4o",
            input=[
                {"role": "system", "content": PROMPT},
                {"role": "user", "content": input},
            ],
            text_format=Time,
        )
        print(response)
        query = json.loads(response.output_parsed.extracted_metadata_fields)
        # response: ChatResponse = ollama_client.chat(
            # model="gemma3n:e4b",
            # messages=[
                # {"role": "system", "content": PROMPT},
                # {"role": "user", "content": input},
            # ],
            # format=GeneratedQuery.model_json_schema(),
        # )
        # query = json.loads(
            # json.loads(response["message"]["content"])["extracted_metadata_fields"]
        # )
        date_key = list(query["created_date"].keys())[0]
        query["created_date"][date_key] = self.date_to_epoch(
            query["created_date"][date_key]
        )
        if "$" not in date_key:
            query["created_date"]["$" + date_key] = query["created_date"][date_key]
        return query
 if __name__ == "__main__":
    qg = QueryGenerator()
    print(qg.get_query("How heavy is Simba?"))
--- a/raggr-frontend/.gitignore
+++ b/raggr-frontend/.gitignore
@@ -0,0 +1,16 @@
 # Local
 .DS_Store
 *.local
 *.log*
 # Dist
 node_modules
 dist/
 # Profile
 .rspack-profile-*/
 # IDE
 .vscode/*
 !.vscode/extensions.json
 .idea
--- a/raggr-frontend/README.md
+++ b/raggr-frontend/README.md
@@ -0,0 +1,36 @@
 # Rsbuild project
 ## Setup
 Install the dependencies:
 ```bash
 pnpm install
 ```
 ## Get started
 Start the dev server, and the app will be available at [http://localhost:3000](http://localhost:3000).
 ```bash
 pnpm dev
 ```
 Build the app for production:
 ```bash
 pnpm build
 ```
 Preview the production build locally:
 ```bash
 pnpm preview
 ```
 ## Learn more
 To learn more about Rsbuild, check out the following resources:
 - [Rsbuild documentation](https://rsbuild.rs) - explore Rsbuild features and APIs.
 - [Rsbuild GitHub repository](https://github.com/web-infra-dev/rsbuild) - your feedback and contributions are welcome!
--- a/raggr-frontend/package.json
+++ b/raggr-frontend/package.json
@@ -0,0 +1,26 @@
 {
  "name": "raggr-frontend",
  "version": "1.0.0",
  "private": true,
  "type": "module",
  "scripts": {
    "build": "rsbuild build",
    "dev": "rsbuild dev --open",
    "preview": "rsbuild preview"
  },
  "dependencies": {
    "axios": "^1.12.2",
    "marked": "^16.3.0",
    "react": "^19.1.1",
    "react-dom": "^19.1.1",
    "react-markdown": "^10.1.0"
  },
  "devDependencies": {
    "@rsbuild/core": "^1.5.6",
    "@rsbuild/plugin-react": "^1.4.0",
    "@tailwindcss/postcss": "^4.0.0",
    "@types/react": "^19.1.13",
    "@types/react-dom": "^19.1.9",
    "typescript": "^5.9.2"
  }
 }
--- a/raggr-frontend/postcss.config.mjs
+++ b/raggr-frontend/postcss.config.mjs
@@ -0,0 +1,5 @@
 export default {
 	plugins: {
 		"@tailwindcss/postcss": {},
 	},
 };
--- a/raggr-frontend/rsbuild.config.ts
+++ b/raggr-frontend/rsbuild.config.ts
@@ -0,0 +1,6 @@
 import { defineConfig } from '@rsbuild/core';
 import { pluginReact } from '@rsbuild/plugin-react';
 export default defineConfig({
  plugins: [pluginReact()],
 });
--- a/raggr-frontend/src/App.css
+++ b/raggr-frontend/src/App.css
@@ -0,0 +1,6 @@
@import "tailwindcss";
 body {
 	margin: 0;
 	font-family: Inter, Avenir, Helvetica, Arial, sans-serif;
 }
--- a/raggr-frontend/src/App.tsx
+++ b/raggr-frontend/src/App.tsx
@@ -0,0 +1,66 @@
 import { useState } from "react";
 import axios from "axios";
 import ReactMarkdown from "react-markdown";
 import "./App.css";
 const App = () => {
 	const [query, setQuery] = useState<string>("");
 	const [answer, setAnswer] = useState<string>("");
 	const [loading, setLoading] = useState<boolean>(false);
 	const handleQuestionSubmit = () => {
 		const payload = { query: query };
 		setLoading(true);
 		axios
 			.post("/api/query", payload)
 			.then((result) => setAnswer(result.data.response))
 			.finally(() => setLoading(false));
 	};
 	const handleQueryChange = (event) => {
 		setQuery(event.target.value);
 	};
 	return (
 		<div className="flex flex-row justify-center py-4">
 			<div className="flex flex-col gap-4 min-w-xl max-w-xl">
 				<div className="flex flex-row justify-center gap-2 grow">
 					<h1 className="text-3xl">ask simba!</h1>
 				</div>
 				<div className="flex flex-row justify-between gap-2 grow">
 					<textarea
 						type="text"
 						className="p-4 border border-blue-200 rounded-md grow"
 						onChange={handleQueryChange}
 					/>
 				</div>
 				<div className="flex flex-row justify-between gap-2 grow">
 					<button
 						className="p-4 border border-blue-400 bg-blue-200 hover:bg-blue-400 cursor-pointer rounded-md flex-grow"
 						onClick={() => handleQuestionSubmit()}
 						type="submit"
 					>
 						Submit
 					</button>
 				</div>
 				{loading ? (
 					<div className="flex flex-col w-full animate-pulse gap-2">
 						<div className="flex flex-row gap-2 w-full">
 							<div className="bg-gray-400 w-1/2 p-3 rounded-lg" />
 							<div className="bg-gray-400 w-1/2 p-3 rounded-lg" />
 						</div>
 						<div className="flex flex-row gap-2 w-full">
 							<div className="bg-gray-400 w-1/3 p-3 rounded-lg" />
 							<div className="bg-gray-400 w-2/3 p-3 rounded-lg" />
 						</div>
 					</div>
 				) : (
 					<div className="flex flex-col">
 						<ReactMarkdown>{answer}</ReactMarkdown>
 					</div>
 				)}
 			</div>
 		</div>
 	);
 };
 export default App;
--- a/raggr-frontend/src/env.d.ts
+++ b/raggr-frontend/src/env.d.ts
@@ -0,0 +1,11 @@
 /// <reference types="@rsbuild/core/types" />
 /**
 * Imports the SVG file as a React component.
 * @requires [@rsbuild/plugin-svgr](https://npmjs.com/package/@rsbuild/plugin-svgr)
 */
 declare module '*.svg?react' {
  import type React from 'react';
  const ReactComponent: React.FunctionComponent<React.SVGProps<SVGSVGElement>>;
  export default ReactComponent;
 }
--- a/raggr-frontend/src/index.tsx
+++ b/raggr-frontend/src/index.tsx
@@ -0,0 +1,13 @@
 import React from 'react';
 import ReactDOM from 'react-dom/client';
 import App from './App';
 const rootEl = document.getElementById('root');
 if (rootEl) {
  const root = ReactDOM.createRoot(rootEl);
  root.render(
    <React.StrictMode>
      <App />
    </React.StrictMode>,
  );
 }
--- a/raggr-frontend/tsconfig.json
+++ b/raggr-frontend/tsconfig.json
@@ -0,0 +1,25 @@
 {
  "compilerOptions": {
    "lib": ["DOM", "ES2020"],
    "jsx": "react-jsx",
    "target": "ES2020",
    "noEmit": true,
    "skipLibCheck": true,
    "useDefineForClassFields": true,
    /* modules */
    "module": "ESNext",
    "moduleDetection": "force",
    "moduleResolution": "bundler",
    "verbatimModuleSyntax": true,
    "resolveJsonModule": true,
    "allowImportingTsExtensions": true,
    "noUncheckedSideEffectImports": true,
    /* type checking */
    "strict": true,
    "noUnusedLocals": true,
    "noUnusedParameters": true
  },
  "include": ["src"]
 }
--- a/raggr-frontend/yarn.lock
+++ b/raggr-frontend/yarn.lock
--- a/request.py
+++ b/request.py
@@ -1,4 +1,5 @@
 import os
 import tempfile
 import httpx
 from dotenv import load_dotenv
@@ -18,7 +19,31 @@ class PaperlessNGXService:
        r = httpx.get(self.url, headers=self.headers)
        return r.json()["results"]
    def get_doc_by_id(self, doc_id: int):
        url = f"http://{os.getenv("BASE_URL")}/api/documents/{doc_id}/"
        r = httpx.get(url, headers=self.headers)
        return r.json()
    def download_pdf_from_id(self, id: int) -> str:
        download_url = f"http://{os.getenv("BASE_URL")}/api/documents/{id}/download/"
        response = httpx.get(
            download_url, headers=self.headers, follow_redirects=True, timeout=30
        )
        response.raise_for_status()
        # Use a temporary file for the downloaded PDF
        temp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".pdf")
        temp_file.write(response.content)
        temp_file.close()
        temp_pdf_path = temp_file.name
        pdf_to_process = temp_pdf_path
        return pdf_to_process
    def upload_cleaned_content(self, document_id, data):
        PUTS_URL = f"http://{os.getenv("BASE_URL")}/api/documents/{document_id}/"
        r = httpx.put(PUTS_URL, headers=self.headers, data=data)
        r.raise_for_status()
 if __name__ == "__main__":
    pp = PaperlessNGXService()
-    print(pp.get_data()[0].keys())
+    pp.get_data()
--- a/startup.sh
+++ b/startup.sh
@@ -0,0 +1,7 @@
 #!/bin/bash
 echo "Starting reindex process..."
 python main.py "" --reindex
 echo "Starting Flask application..."
 python app.py
--- a/uv.lock
+++ b/uv.lock
Author	SHA1	Message	Date
Ryan Chen	3ffc95a1b0	Switch to OpenAI embeddings for ChromaDB Replace Ollama embedding function with OpenAI's text-embedding-3-small model for improved embedding quality and consistency. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-02 21:05:17 -04:00
Ryan Chen	c5091dc07a	Configure Docker for Linux host networking and add startup reindex - Switch to host network mode for direct access to Ollama on host - Update OLLAMA_URL to use localhost:11434 - Add startup.sh script to trigger reindex before app starts - Update Dockerfile to execute startup script 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-02 21:02:55 -04:00
Ryan Chen	c140758560	asfd	2025-10-02 20:57:19 -04:00
Ryan Chen	ab3a0eb442	Reorganize Dockerfile to copy application code before frontend build Move Python application code copy before frontend build step to improve Dockerfile organization and ensure all app code is available earlier. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-02 20:48:52 -04:00
Ryan Chen	c619d78922	Adding axios	2025-10-02 20:46:10 -04:00
Ryan Chen	c20ae0a4b9	Add missing @tailwindcss/postcss dependency to frontend Fix Docker build failure by adding @tailwindcss/postcss package required by postcss.config.mjs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-02 20:44:49 -04:00
Ryan Chen	26cc01b58b	Add frontend build step to Dockerfile Install Node.js and Yarn, then build the raggr-frontend during Docker image build process. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-02 20:42:01 -04:00
Ryan Chen	746b60e070	Switch to using torrtle/simbarag:latest Docker image Replace local build with pre-built image from Docker Hub 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-02 20:39:36 -04:00
Ryan Chen	577c9144ac	Switch Dockerfile to use uv for dependency management - Install uv via official installer script - Replace pip with uv pip install --system - Add uv to PATH for container usage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-02 20:36:45 -04:00
Ryan Chen	2b2891bd79	Fix and add missing dependencies to pyproject.toml - Fix dotenv package name to python-dotenv - Add pillow for image processing - Add pymupdf for PDF handling 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-02 20:34:59 -04:00
Ryan Chen	03b033e9a4	Configure ollama to use external host instead of docker service - Update all ollama clients to use configurable OLLAMA_URL environment variable - Remove ollama service from docker-compose.yml to use external ollama instance - Configure docker-compose to connect to host ollama via 172.17.0.1:11434 (Linux) or host.docker.internal (macOS/Windows) - Add cross-platform compatibility with extra_hosts mapping - Update embedding function fallback URL for consistency 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-02 20:29:48 -04:00
Ryan Chen	a640ae5fed	Docker stuff	2025-10-02 20:21:48 -04:00
Ryan Chen	99c98b7e42	yeet	2025-10-02 19:21:24 -04:00
ryan	a69f7864f3	Merge pull request 'yeat' (#3 ) from rc/9-metadata-date-filtering into main Reviewed-on: #3	2025-08-07 17:43:59 -04:00
Ryan Chen	679cfb08e4	yeat	2025-08-07 17:43:24 -04:00
ryan	fc504d3e9c	Merge pull request 'Adding some funny stuff' (#2 ) from data-preprocessing into main Reviewed-on: #2 implements #1	2025-07-30 20:30:34 -04:00
Ryan Chen	c7152d3f32	Moving chromadb to env var	2025-07-30 20:27:03 -04:00
Ryan Chen	0a88a03c90	Expanded context window, CLI'd the app, and added preprocessing	2025-07-30 19:58:29 -04:00
Ryan Chen	b43ef63449	Adding some funny stuff	2025-07-29 22:59:40 -04:00
ryan	b698109183	Merge pull request 'Adding more embeddings' (#1 ) from better-embeddings into main Reviewed-on: #1	2025-07-26 19:55:31 -04:00