Files
Ryan Chen 126b53f17d docs(1): research phase domain
Phase 1: Foundation
- Standard stack identified (aioimaplib, email, cryptography, Quart-Tasks)
- Architecture patterns documented (encrypted fields, IMAP lifecycle, parsing)
- Pitfalls catalogued (connection limits, encoding, key loss, sync state, IDLE, HTML bloat)
2026-02-07 13:31:42 -05:00

808 lines
30 KiB
Markdown
Raw Permalink Blame History

# Phase 1: Foundation - Research
**Researched:** 2026-02-07
**Domain:** Email ingestion infrastructure (IMAP, parsing, encryption, database)
**Confidence:** HIGH
## Summary
Phase 1 establishes the core infrastructure for IMAP email ingestion. The standard Python async stack provides mature, well-documented solutions for all required components. The research confirms that:
1. **aioimaplib** (v2.0.1, Jan 2025) is the production-ready async IMAP client for Python 3.9+
2. Python's built-in **email** module handles multipart message parsing without additional dependencies
3. **cryptography** library's Fernet provides secure credential encryption with established patterns
4. **Tortoise ORM** custom fields enable transparent encryption/decryption at the database layer
5. **Quart-Tasks** integrates scheduled IMAP sync directly into the existing Quart application
The codebase already demonstrates the required patterns: Tortoise ORM models with foreign keys (conversations/messages), ChromaDB collection management (simba_docs2, feline_vet_lookup), and async Quart blueprints with JWT/admin decorators.
**Primary recommendation:** Build three Tortoise ORM models (EmailAccount, EmailSyncStatus, Email) with encrypted credentials field, use aioimaplib for IMAP operations, Python email module for parsing, and Quart-Tasks for scheduling. Reuse existing admin_required decorator pattern and ChromaDB collection approach.
## Standard Stack
### Core
| Library | Version | Purpose | Why Standard |
|---------|---------|---------|--------------|
| aioimaplib | 2.0.1 (Jan 2025) | Async IMAP4rev1 client | Only mature async IMAP library; tested against Python 3.9-3.12; no runtime dependencies; RFC2177 IDLE support |
| email (stdlib) | 3.14+ | Email parsing (multipart, headers) | Built-in; official standard for email parsing; modern EmailMessage API with get_body() |
| cryptography | 46.0.4 (Jan 2026) | Fernet symmetric encryption | Industry standard; widely audited; MultiFernet for key rotation; Python 3.8+ support |
| tortoise-orm | 0.25.4 | ORM with custom fields | Already in use; custom field support via to_db_value/to_python_value |
| quart-tasks | Latest | Scheduled background tasks | Designed for Quart; async-native; cron and periodic scheduling |
### Supporting
| Library | Version | Purpose | When to Use |
|---------|---------|---------|-------------|
| html2text | 2025.4.15 | HTML to plain text | When email body is HTML-only; converts to readable text |
| beautifulsoup4 | Latest | HTML parsing fallback | When html2text fails; more control over extraction |
| asyncio (stdlib) | 3.14+ | Async operations | IMAP connection management, timeout handling |
### Alternatives Considered
| Instead of | Could Use | Tradeoff |
|------------|-----------|----------|
| aioimaplib | imaplib (stdlib sync) | imaplib is blocking; would require thread pools; no IDLE support; not Quart-compatible |
| aioimaplib | pymap | pymap is a server library, not client; wrong use case |
| Fernet | bcrypt | bcrypt is one-way hashing for passwords; Fernet is reversible encryption for credentials |
| Quart-Tasks | APScheduler AsyncIOScheduler | APScheduler adds dependency; Quart-Tasks is tighter integration; cron syntax compatible |
| email module | mail-parser | mail-parser adds dependency; stdlib sufficient for standard emails; overhead not justified |
**Installation:**
```bash
# Core dependencies (add to pyproject.toml)
pip install aioimaplib cryptography quart-tasks
# Optional HTML parsing
pip install html2text beautifulsoup4
```
## Architecture Patterns
### Recommended Project Structure
```
blueprints/
├── email/ # New email blueprint
│ ├── __init__.py # Routes (admin-only, follows existing pattern)
│ ├── models.py # EmailAccount, EmailSyncStatus, Email
│ ├── imap_service.py # IMAP connection utility
│ ├── parser_service.py # Email body parsing
│ └── crypto_service.py # Credential encryption utility
utils/
├── email_chunker.py # Email-specific chunking (reuse Chunker pattern)
```
### Pattern 1: Encrypted Tortoise ORM Field
**What:** Custom field that transparently encrypts on write and decrypts on read
**When to use:** Storing reversible sensitive data (IMAP passwords, tokens)
**Example:**
```python
# Source: https://tortoise.github.io/fields.html + https://cryptography.io/en/latest/fernet/
from tortoise import fields
from cryptography.fernet import Fernet
import os
class EncryptedTextField(fields.TextField):
"""Transparently encrypts/decrypts text field using Fernet."""
def __init__(self, **kwargs):
super().__init__(**kwargs)
# Key from environment variable (32-byte URL-safe base64)
key = os.getenv("FERNET_KEY")
if not key:
raise ValueError("FERNET_KEY environment variable required")
self.fernet = Fernet(key.encode())
def to_db_value(self, value: str, instance) -> str:
"""Encrypt before storing in database"""
if value is None:
return None
# Returns Fernet token (URL-safe base64 string)
return self.fernet.encrypt(value.encode()).decode()
def to_python_value(self, value: str) -> str:
"""Decrypt when loading from database"""
if value is None:
return None
return self.fernet.decrypt(value.encode()).decode()
# Usage in model
class EmailAccount(Model):
password = EncryptedTextField() # Transparent encryption
```
### Pattern 2: IMAP Connection Lifecycle
**What:** Async context manager for IMAP connections with proper cleanup
**When to use:** All IMAP operations (fetch, list folders, sync)
**Example:**
```python
# Source: https://github.com/bamthomas/aioimaplib README
import asyncio
from aioimaplib import IMAP4_SSL
class IMAPService:
async def connect(self, host: str, user: str, password: str):
"""
Establish IMAP connection with proper lifecycle.
CRITICAL: Must call logout() to close TCP connection.
close() only closes mailbox, not connection.
"""
imap = IMAP4_SSL(host=host)
await imap.wait_hello_from_server()
try:
await imap.login(user, password)
return imap
except Exception as e:
await imap.logout() # Clean up on login failure
raise
async def list_folders(self, imap):
"""List all mailbox folders"""
# LIST returns: (* LIST (\HasNoChildren) "/" "INBOX")
response = await imap.list('""', '*')
return self._parse_list_response(response)
async def fetch_messages(self, imap, folder="INBOX", limit=100):
"""Fetch recent messages from folder"""
await imap.select(folder)
# Search for all messages
response = await imap.search('ALL')
message_ids = response.lines[0].split()
# Fetch last N messages
recent_ids = message_ids[-limit:]
messages = []
for msg_id in recent_ids:
# FETCH returns full RFC822 message
msg_data = await imap.fetch(msg_id, '(RFC822)')
messages.append(msg_data)
return messages
async def close(self, imap):
"""Properly close IMAP connection"""
try:
await imap.logout() # Closes TCP connection
except Exception:
pass # Best effort cleanup
# Usage with context manager pattern
async def sync_emails(account: EmailAccount):
service = IMAPService()
imap = await service.connect(
account.imap_host,
account.imap_username,
account.password # Auto-decrypted by EncryptedTextField
)
try:
messages = await service.fetch_messages(imap)
# Process messages...
finally:
await service.close(imap)
```
### Pattern 3: Email Body Parsing (Multipart/Alternative)
**What:** Extract plain text and HTML bodies from multipart messages
**When to use:** Processing all incoming emails
**Example:**
```python
# Source: https://docs.python.org/3/library/email.message.html
from email import message_from_bytes
from email.policy import default
def parse_email_body(raw_email_bytes: bytes) -> dict:
"""
Extract text and HTML bodies from email.
Returns: {"text": str, "html": str, "preferred": str}
"""
# Parse with modern EmailMessage API
msg = message_from_bytes(raw_email_bytes, policy=default)
result = {"text": None, "html": None, "preferred": None}
# Try to get plain text body
text_part = msg.get_body(preferencelist=('plain',))
if text_part:
result["text"] = text_part.get_content()
# Try to get HTML body
html_part = msg.get_body(preferencelist=('html',))
if html_part:
result["html"] = html_part.get_content()
# Determine preferred version (plain text preferred for RAG)
if result["text"]:
result["preferred"] = result["text"]
elif result["html"]:
# Convert HTML to text if no plain text version
import html2text
h = html2text.HTML2Text()
h.ignore_links = False
result["preferred"] = h.handle(result["html"])
# Extract metadata
result["subject"] = msg.get("subject", "")
result["from"] = msg.get("from", "")
result["to"] = msg.get("to", "")
result["date"] = msg.get("date", "")
result["message_id"] = msg.get("message-id", "")
return result
```
### Pattern 4: Scheduled Email Sync with Quart-Tasks
**What:** Background task that syncs emails periodically
**When to use:** Production deployment with regular sync intervals
**Example:**
```python
# Source: https://github.com/pgjones/quart-tasks
from quart import Quart
from quart_tasks import QuartTasks
from datetime import timedelta
app = Quart(__name__)
tasks = QuartTasks(app)
@tasks.cron("0 */2 * * *") # Every 2 hours at :00
async def scheduled_email_sync():
"""
Sync emails from all active accounts.
Runs every 2 hours. Cron format: minute hour day month weekday
"""
from blueprints.email.models import EmailAccount
accounts = await EmailAccount.filter(is_active=True).all()
for account in accounts:
try:
await sync_account_emails(account)
except Exception as e:
# Log but continue with other accounts
app.logger.error(f"Sync failed for {account.email}: {e}")
# Alternative: periodic scheduling
@tasks.periodic(timedelta(hours=2))
async def periodic_email_sync():
"""Same as above but using timedelta"""
pass
# Manual trigger via CLI
# quart invoke-task scheduled_email_sync
```
### Pattern 5: ChromaDB Email Collection
**What:** Separate collection for email embeddings with metadata
**When to use:** All email indexing operations
**Example:**
```python
# Source: Existing main.py patterns
import chromadb
import os
# Initialize ChromaDB (reuse existing client pattern)
client = chromadb.PersistentClient(path=os.getenv("CHROMADB_PATH", ""))
# Create email collection (similar to simba_docs2, feline_vet_lookup)
email_collection = client.get_or_create_collection(
name="email_messages",
metadata={"description": "Email message embeddings for RAG"}
)
# Add email with metadata
from utils.chunker import Chunker
async def index_email(email: Email):
"""Index single email into ChromaDB"""
chunker = Chunker(email_collection)
# Prepare text (body + subject for context)
text = f"Subject: {email.subject}\n\n{email.body_text}"
# Metadata for filtering
metadata = {
"email_id": str(email.id),
"from_address": email.from_address,
"to_address": email.to_address,
"subject": email.subject,
"date": email.date.timestamp(),
"account_id": str(email.account_id),
"message_id": email.message_id,
}
# Chunk and embed (reuses existing pattern)
chunker.chunk_document(
document=text,
metadata=metadata,
chunk_size=1000
)
```
### Anti-Patterns to Avoid
- **Don't use IMAP4.close() to disconnect**: It only closes the mailbox, not TCP connection. Always use logout()
- **Don't store encryption keys in code**: Use environment variables and proper key management
- **Don't share IMAP connections across async tasks**: Each task needs its own connection (not thread-safe)
- **Don't fetch all messages on every sync**: Track last sync timestamp and fetch incrementally
- **Don't parse HTML with regex**: Use html2text or BeautifulSoup for proper parsing
- **Don't store plaintext passwords**: Always use EncryptedTextField for credentials
## Don't Hand-Roll
Problems that look simple but have existing solutions:
| Problem | Don't Build | Use Instead | Why |
|---------|-------------|-------------|-----|
| IMAP protocol | Custom socket code | aioimaplib | IMAP has complex state machine, authentication flows (OAUTH2), IDLE support, error handling |
| Email parsing | String splitting / regex | email (stdlib) | MIME multipart is complex; nested parts; encoding issues; attachment handling |
| Credential encryption | Custom XOR / Caesar cipher | cryptography.fernet | Fernet provides authenticated encryption (AES + HMAC); time-based validation; key rotation |
| HTML to text | Regex strip tags | html2text | Preserves structure; handles entities; converts to markdown; handles nested tags |
| Scheduled tasks | while True + asyncio.sleep | Quart-Tasks | Cron syntax; error handling; graceful shutdown; CLI integration; no drift |
| Email deduplication | Compare body text | message-id header | RFC-compliant unique identifier; handles threading; forwards detection |
**Key insight:** Email handling involves decades of RFC specifications (RFC 3501 IMAP, RFC 2822 message format, RFC 2047 encoding, RFC 6154 special folders). Standard libraries internalize this complexity.
## Common Pitfalls
### Pitfall 1: IMAP Connection Limits
**What goes wrong:** Provider terminates connections with "Too many connections" error. Gmail limits 15 concurrent connections per account, Yahoo limits 5.
**Why it happens:**
- Each IMAP connection is counted against account quota
- Connections not properly closed leak quota
- Multiple sync tasks create concurrent connections
- Provider counts connections across all devices
**How to avoid:**
- Use connection pooling with max_connections limit
- Set connection timeout to 10 seconds (detect dead connections)
- Always call logout() in finally block
- Implement exponential backoff on connection errors
- Track active connections per account
**Warning signs:**
- Intermittent "Connection refused" errors
- Sync works initially then fails
- Errors after deploying multiple instances
### Pitfall 2: Message Encoding Hell
**What goes wrong:** Emails display as garbled characters (<28>) or wrong language characters.
**Why it happens:**
- Email headers/body can be in various encodings (UTF-8, ISO-8859-1, Windows-1252)
- RFC 2047 encoded-words in headers (`=?UTF-8?B?...?=`)
- Base64 or quoted-printable transfer encoding
- Charset mismatch between declaration and actual content
**How to avoid:**
- Use email.policy.default (handles encoding automatically)
- Call get_content() not get_payload() (modern API does decoding)
- Catch UnicodeDecodeError and try common fallback encodings
- Log original encoding for debugging
**Warning signs:**
- Subject lines with `=?UTF-8?` visible in output
- Asian/emoji characters showing as `?` or boxes
- Stack traces with UnicodeDecodeError
### Pitfall 3: Fernet Key Loss = Data Loss
**What goes wrong:** Application starts but can't decrypt existing credentials. All IMAP accounts become inaccessible.
**Why it happens:**
- FERNET_KEY environment variable changed or missing
- Database migrated without bringing encryption key
- Key rotation done incorrectly (dropped old key while data still encrypted)
- Development vs production key mismatch
**How to avoid:**
- Document FERNET_KEY as required in .env.example
- Add startup validation: decrypt test value or fail fast
- Use MultiFernet for key rotation (keeps old key for decryption)
- Back up encryption key separately from database
- Test database restore process includes key
**Warning signs:**
- cryptography.fernet.InvalidToken exceptions on account.password access
- Cannot authenticate to IMAP after deployment
- Error: "Fernet key must be 32 url-safe base64-encoded bytes"
### Pitfall 4: Not Tracking Sync State
**What goes wrong:** Re-downloads thousands of emails on every sync. Database fills with duplicates. API rate limits hit.
**Why it happens:**
- No tracking of last synced message
- Using IMAP SEARCH ALL instead of SINCE date
- Not using message-id for deduplication
- Sync status not persisted across restarts
**How to avoid:**
- EmailSyncStatus table tracks last_sync_date, last_message_uid per account
- IMAP UID (unique ID) for reliable message tracking
- Use SEARCH SINCE <date> to fetch only new messages
- Check message-id before inserting (ON CONFLICT DO NOTHING)
- Update sync status atomically with message insert
**Warning signs:**
- Sync time increases linearly with mailbox age
- Database size grows faster than email volume
- Duplicate emails in search results
### Pitfall 5: IMAP IDLE Hanging Forever
**What goes wrong:** IMAP sync task never completes. Application appears frozen. No new emails processed.
**Why it happens:**
- IDLE command waits indefinitely for new mail
- Network timeout disconnects but code doesn't detect
- Provider drops connection after 30 minutes (standard timeout)
- No timeout set on wait_server_push()
**How to avoid:**
- Don't use IDLE for scheduled sync (use SEARCH instead)
- If using IDLE, set timeout: `await imap.wait_server_push(timeout=600)`
- Implement connection health checks (NOOP command)
- Handle asyncio.TimeoutError and reconnect
- Use IDLE only for real-time notifications (out of scope for Phase 1)
**Warning signs:**
- Scheduled sync never completes
- No logs after "IDLE command sent"
- Task shows running but no activity
### Pitfall 6: HTML Email Bloat in Embeddings
**What goes wrong:** Email embeddings are poor quality. Search returns irrelevant results. ChromaDB storage explodes.
**Why it happens:**
- Storing raw HTML with tags/styles in vectors
- Email signatures with base64 images embedded
- Marketing emails with 90% HTML boilerplate
- Script tags, CSS, tracking pixels in body
**How to avoid:**
- Always convert HTML to plain text before indexing
- Strip email signatures (common patterns: "-- " divider, "Sent from my iPhone")
- Remove quoted reply text ("> " prefix detection)
- Limit chunk size to exclude metadata bloat
- Prefer plain text body over HTML when both available
**Warning signs:**
- Email search returns marketing emails for every query
- Embeddings contain HTML tag tokens
- Vector dimension much larger than document embeddings
## Code Examples
Verified patterns from official sources:
### Example 1: Complete IMAP Sync Flow
```python
# Source: Composite of aioimaplib + email module patterns
from aioimaplib import IMAP4_SSL
from email import message_from_bytes
from email.policy import default
import asyncio
async def sync_account_emails(account: EmailAccount):
"""
Complete sync flow: connect, fetch, parse, store.
"""
# 1. Establish connection
imap = IMAP4_SSL(host=account.imap_host, timeout=10)
await imap.wait_hello_from_server()
try:
# 2. Authenticate
await imap.login(account.imap_username, account.password)
# 3. Select INBOX
await imap.select('INBOX')
# 4. Get last sync status
sync_status = await EmailSyncStatus.get_or_none(account=account)
last_uid = sync_status.last_message_uid if sync_status else 1
# 5. Search for new messages (UID > last_uid)
response = await imap.uid('search', None, f'UID {last_uid}:*')
message_uids = response.lines[0].split()
# 6. Fetch and process each message
for uid in message_uids:
# Fetch full message
fetch_result = await imap.uid('fetch', uid, '(RFC822)')
raw_email = fetch_result.lines[1] # Email bytes
# Parse email
msg = message_from_bytes(raw_email, policy=default)
# Extract components
email_data = {
'account': account,
'message_id': msg.get('message-id'),
'subject': msg.get('subject', ''),
'from_address': msg.get('from', ''),
'to_address': msg.get('to', ''),
'date': parsedate_to_datetime(msg.get('date')),
'body_text': None,
'body_html': None,
}
# Get body content
text_part = msg.get_body(preferencelist=('plain',))
if text_part:
email_data['body_text'] = text_part.get_content()
html_part = msg.get_body(preferencelist=('html',))
if html_part:
email_data['body_html'] = html_part.get_content()
# 7. Store in database (check for duplicates)
email_obj, created = await Email.get_or_create(
message_id=email_data['message_id'],
defaults=email_data
)
# 8. Index in ChromaDB if new
if created:
await index_email(email_obj)
# 9. Update sync status
await EmailSyncStatus.update_or_create(
account=account,
defaults={
'last_sync_date': datetime.now(),
'last_message_uid': message_uids[-1] if message_uids else last_uid,
'message_count': len(message_uids),
}
)
finally:
# 10. Always logout
await imap.logout()
```
### Example 2: Fernet Key Generation and Setup
```python
# Source: https://cryptography.io/en/latest/fernet/
from cryptography.fernet import Fernet
# One-time setup: Generate key
def generate_fernet_key():
"""
Generate new Fernet encryption key.
CRITICAL: Store this in environment variable.
If lost, encrypted data cannot be recovered.
"""
key = Fernet.generate_key()
print(f"Add to .env file:")
print(f"FERNET_KEY={key.decode()}")
return key
# Add to .env.example
"""
# Email Encryption Key (32-byte URL-safe base64)
# Generate with: python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"
FERNET_KEY=your-fernet-key-here
"""
# Startup validation
def validate_fernet_key():
"""Validate encryption key on app startup"""
key = os.getenv("FERNET_KEY")
if not key:
raise ValueError("FERNET_KEY environment variable required")
try:
f = Fernet(key.encode())
# Test encrypt/decrypt
test = f.encrypt(b"test")
f.decrypt(test)
except Exception as e:
raise ValueError(f"Invalid FERNET_KEY: {e}")
```
### Example 3: Email Models with Encryption
```python
# Source: Tortoise ORM patterns from existing codebase
from tortoise.models import Model
from tortoise import fields
from datetime import datetime
class EmailAccount(Model):
"""
Email account configuration.
Multiple accounts supported (personal, work, etc.)
"""
id = fields.UUIDField(primary_key=True)
user = fields.ForeignKeyField('models.User', related_name='email_accounts')
# Account info
email_address = fields.CharField(max_length=255, unique=True)
display_name = fields.CharField(max_length=255, null=True)
# IMAP settings
imap_host = fields.CharField(max_length=255) # e.g., imap.gmail.com
imap_port = fields.IntField(default=993)
imap_username = fields.CharField(max_length=255)
imap_password = EncryptedTextField() # Encrypted at rest
# Status
is_active = fields.BooleanField(default=True)
last_error = fields.TextField(null=True)
created_at = fields.DatetimeField(auto_now_add=True)
updated_at = fields.DatetimeField(auto_now=True)
class Meta:
table = "email_accounts"
class EmailSyncStatus(Model):
"""
Tracks sync progress per account.
Prevents re-downloading messages.
"""
id = fields.UUIDField(primary_key=True)
account = fields.ForeignKeyField('models.EmailAccount', related_name='sync_status', unique=True)
last_sync_date = fields.DatetimeField(null=True)
last_message_uid = fields.IntField(default=0) # IMAP UID of last fetched message
message_count = fields.IntField(default=0)
# Error tracking
consecutive_failures = fields.IntField(default=0)
last_failure_date = fields.DatetimeField(null=True)
updated_at = fields.DatetimeField(auto_now=True)
class Meta:
table = "email_sync_status"
class Email(Model):
"""
Email message metadata and content.
30-day retention enforced at application level.
"""
id = fields.UUIDField(primary_key=True)
account = fields.ForeignKeyField('models.EmailAccount', related_name='emails')
# Email metadata
message_id = fields.CharField(max_length=255, unique=True, index=True) # RFC822 Message-ID
subject = fields.CharField(max_length=500)
from_address = fields.CharField(max_length=255)
to_address = fields.TextField() # May have multiple recipients
date = fields.DatetimeField()
# Body content
body_text = fields.TextField(null=True) # Plain text version
body_html = fields.TextField(null=True) # HTML version
# Vector store reference
chromadb_doc_id = fields.CharField(max_length=255, null=True) # Links to ChromaDB
# Retention
created_at = fields.DatetimeField(auto_now_add=True)
expires_at = fields.DatetimeField() # Auto-set to created_at + 30 days
class Meta:
table = "emails"
async def save(self, *args, **kwargs):
"""Auto-set expiration date"""
if not self.expires_at:
self.expires_at = datetime.now() + timedelta(days=30)
await super().save(*args, **kwargs)
```
## State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|--------------|------------------|--------------|--------|
| imaplib (sync) | aioimaplib (async) | 2016 | Non-blocking IMAP; Quart-compatible; better performance |
| Message.walk() | msg.get_body() | Python 3.6+ (2017) | Simplified API; handles multipart correctly; policy-aware |
| PyCrypto | cryptography | 2016 | Actively maintained; audited; proper key rotation |
| cron system jobs | Quart-Tasks | 2020+ | Application-integrated; async-native; no external cron |
| email.message | email.message.EmailMessage | Python 3.6+ | Better API; policy system; modern email handling |
**Deprecated/outdated:**
- **imaplib2**: Unmaintained since 2015; use aioimaplib
- **PyCrypto**: Abandoned 2013; use cryptography
- **Message.get_payload()**: Use get_content() for proper decoding
- **email.parser.Parser**: Use BytesParser with policy for modern parsing
## Open Questions
Things that couldn't be fully resolved:
1. **IMAP OAUTH2 Support**
- What we know: aioimaplib supports OAUTH2 authentication
- What's unclear: Gmail requires OAUTH2 for new accounts (may need app registration)
- Recommendation: Start with password auth; add OAUTH2 in Phase 2 if needed
2. **Attachment Handling**
- What we know: Email attachments excluded from Phase 1 scope
- What's unclear: Should attachment metadata be stored (filename, size)?
- Recommendation: Store metadata (attachment_count field), skip content for now
3. **Folder Selection Strategy**
- What we know: Most providers have INBOX, Sent, Drafts, Trash
- What's unclear: Should we sync only INBOX or multiple folders?
- Recommendation: Start with INBOX only; make folder list configurable
4. **Embedding Model for Emails**
- What we know: Existing codebase uses text-embedding-3-small (OpenAI)
- What's unclear: Do email embeddings need different model than documents?
- Recommendation: Reuse existing embedding model for consistency
5. **Concurrent Account Syncing**
- What we know: Multiple accounts should sync independently
- What's unclear: Should syncs run in parallel or sequentially?
- Recommendation: Sequential for Phase 1; parallel with asyncio.gather in later phase
## Sources
### Primary (HIGH confidence)
- aioimaplib v2.0.1 - https://github.com/bamthomas/aioimaplib (Jan 2025 release)
- aioimaplib PyPI - https://pypi.org/project/aioimaplib/ (v2.0.1, Python 3.9-3.12)
- Python email.parser docs - https://docs.python.org/3/library/email.parser.html (Feb 2026)
- Python email.message docs - https://docs.python.org/3/library/email.message.html (Feb 2026)
- cryptography Fernet docs - https://cryptography.io/en/latest/fernet/ (v47.0.0.dev1)
- Tortoise ORM fields docs - https://tortoise.github.io/fields.html (v0.25.4)
- Quart-Tasks GitHub - https://github.com/pgjones/quart-tasks (official extension)
### Secondary (MEDIUM confidence)
- IMAP commands reference - https://www.atmail.com/blog/imap-commands/ (tutorial)
- RFC 3501 IMAP4rev1 - https://www.rfc-editor.org/rfc/rfc3501 (official spec)
- RFC 6154 Special-Use Mailboxes - https://www.rfc-editor.org/rfc/rfc6154.html (official spec)
- html2text PyPI - https://pypi.org/project/html2text/ (v2025.4.15)
- Job Scheduling with APScheduler - https://betterstack.com/community/guides/scaling-python/apscheduler-scheduled-tasks/ (2024 guide)
### Secondary (MEDIUM confidence - verified with official docs)
- Email parsing guide - https://www.nylas.com/blog/email-parsing-with-python-a-comprehensive-guide/ (verified against Python docs)
- Fernet best practices - Multiple sources cross-referenced with official cryptography docs
- IMAP security best practices - https://www.getmailbird.com/sudden-spike-imap-sync-failures-email-providers/ (2026 article, current issues)
### Tertiary (LOW confidence - WebSearch only)
- mail-parser library - https://github.com/SpamScope/mail-parser (alternative, not fully evaluated)
- flanker library - https://github.com/mailgun/flanker (alternative, not fully evaluated)
## Metadata
**Confidence breakdown:**
- Standard stack: **HIGH** - All libraries verified via official docs/PyPI; current versions confirmed; Python 3.9+ compatibility validated
- Architecture: **HIGH** - Patterns demonstrated in existing codebase (Tortoise models, Quart blueprints, ChromaDB collections)
- Pitfalls: **MEDIUM** - Based on documentation warnings + community reports; some edge cases may exist
- OAUTH2 implementation: **LOW** - Not fully researched for this phase
**Research date:** 2026-02-07
**Valid until:** 2026-04-07 (60 days - stable technologies with slow release cycles)
**Notes:**
- aioimaplib actively maintained (Jan 2025 release)
- Python 3.14 stdlib recent (Feb 2026 docs)
- cryptography library rapid releases (security-focused)
- Recommend re-validating aioimaplib/cryptography versions at implementation time