Files

Ryan Chen 2305dfddb1 Add async video downloads with yt-dlp and Celery

- Added yt-dlp, celery, and redis dependencies to pyproject.toml
- Extended VideoEntry model with download tracking fields:
  - download_status (enum: pending, downloading, completed, failed)
  - download_path, download_started_at, download_completed_at
  - download_error, file_size
- Created celery_app.py with Redis broker configuration
- Created download_service.py with async download tasks:
  - download_video() task downloads as MP4 format
  - Configured yt-dlp for best MP4 quality with fallback
  - Automatic retries on failure (max 3 attempts)
  - Progress tracking and database updates
- Added Flask API endpoints in main.py:
  - POST /api/download/<video_id> to trigger download
  - GET /api/download/status/<video_id> to check status
  - POST /api/download/batch for bulk downloads
- Generated and applied Alembic migration for new fields
- Created downloads/ directory for video storage
- Updated .gitignore to exclude downloads/ directory
- Updated CLAUDE.md with comprehensive documentation:
  - Redis and Celery setup instructions
  - Download workflow and architecture
  - yt-dlp configuration details
  - New API endpoint examples

Videos are downloaded as MP4 files using Celery workers.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-26 14:04:30 -05:00

8.9 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

yottob is a Flask-based web application for processing YouTube RSS feeds with SQLAlchemy ORM persistence and async video downloads. The project provides both a REST API and CLI interface for fetching and parsing YouTube channel feeds, with filtering logic to exclude YouTube Shorts. All fetched feeds are automatically saved to a SQLite database for historical tracking. Videos can be downloaded asynchronously as MP4 files using Celery workers and yt-dlp.

Development Setup

This project uses uv for dependency management.

Install dependencies:

uv sync

Activate virtual environment:

source .venv/bin/activate  # On macOS/Linux

Initialize/update database:

# Run migrations to create or update database schema
source .venv/bin/activate && alembic upgrade head

Start Redis (required for Celery):

# macOS with Homebrew
brew services start redis

# Linux
sudo systemctl start redis

# Docker
docker run -d -p 6379:6379 redis:alpine

# Verify Redis is running
redis-cli ping  # Should return "PONG"

Start Celery worker (required for video downloads):

source .venv/bin/activate && celery -A celery_app worker --loglevel=info

Running the Application

Run the CLI feed parser:

python main.py

This executes the main() function which fetches and parses a YouTube channel RSS feed for testing.

Run the Flask web application:

flask --app main run

The web server exposes:

/ - Main page (renders index.html)
/api/feed - API endpoint for fetching feeds and saving to database
/api/channels - List all tracked channels
/api/history/<channel_id> - Get video history for a specific channel
/api/download/<video_id> - Trigger video download (POST)
/api/download/status/<video_id> - Check download status (GET)
/api/download/batch - Batch download multiple videos (POST)

API Usage Examples:

# Fetch default channel feed (automatically saves to DB)
curl http://localhost:5000/api/feed

# Fetch specific channel with options
curl "http://localhost:5000/api/feed?channel_id=CHANNEL_ID&filter_shorts=false&save=true"

# List all tracked channels
curl http://localhost:5000/api/channels

# Get video history for a channel (limit 20 videos)
curl "http://localhost:5000/api/history/CHANNEL_ID?limit=20"

# Trigger download for a specific video
curl -X POST http://localhost:5000/api/download/123

# Check download status
curl http://localhost:5000/api/download/status/123

# Batch download all pending videos for a channel
curl -X POST "http://localhost:5000/api/download/batch?channel_id=CHANNEL_ID&status=pending"

# Batch download specific video IDs
curl -X POST http://localhost:5000/api/download/batch \
  -H "Content-Type: application/json" \
  -d '{"video_ids": [1, 2, 3, 4, 5]}'

Architecture

The codebase follows a clean layered architecture with separation of concerns:

Database Layer

models.py - SQLAlchemy ORM models

Base: Declarative base for all models
DownloadStatus: Enum for download states (pending, downloading, completed, failed)
Channel: Stores YouTube channel metadata (channel_id, title, link, last_fetched)
VideoEntry: Stores individual video entries with foreign key to Channel, plus download tracking fields:
- download_status, download_path, download_started_at, download_completed_at, download_error, file_size
Relationships: One Channel has many VideoEntry records

database.py - Database configuration and session management

DATABASE_URL: SQLite database location (yottob.db)
engine: SQLAlchemy engine instance
init_db(): Creates all tables
get_db_session(): Context manager for database sessions

Async Task Queue Layer

celery_app.py - Celery configuration

Celery instance configured with Redis broker
Task serialization and worker configuration
1-hour task timeout with automatic retries

download_service.py - Video download tasks

download_video(video_id): Celery task to download a single video as MP4
- Uses yt-dlp with MP4 format preference
- Updates database with download progress and status
- Automatic retry on failure (max 3 attempts)
download_videos_batch(video_ids): Queue multiple downloads
Downloads saved to downloads/ directory

Core Logic Layer

feed_parser.py - Reusable YouTube feed parsing module

YouTubeFeedParser: Main parser class that encapsulates channel-specific logic
FeedEntry: In-memory data model for feed entries
fetch_feed(): Fetches and parses RSS feeds
save_to_db(): Persists feed data to database with upsert logic
Independent of Flask - can be imported and used in any Python context

Web Server Layer

main.py - Flask application and routes

app: Flask application instance (main.py:10)
Database initialization on startup (main.py:16)
index(): Homepage route handler (main.py:21)
get_feed(): REST API endpoint (main.py:27) that fetches and saves to DB
get_channels(): Lists all tracked channels (main.py:60)
get_history(): Returns video history for a channel (main.py:87)
trigger_download(): Queue video download task (main.py:134)
get_download_status(): Check download status (main.py:163)
trigger_batch_download(): Queue multiple downloads (main.py:193)
main(): CLI entry point for testing (main.py:251)

Templates

templates/index.html - Frontend HTML (currently static placeholder)

Feed Parsing Implementation

The YouTubeFeedParser class in feed_parser.py:

Constructs YouTube RSS feed URLs from channel IDs
Uses feedparser to fetch and parse feeds
Validates HTTP 200 status before processing
Optionally filters out YouTube Shorts (any entry with "shorts" in URL)
Returns structured dictionary with feed metadata and entries

YouTube RSS Feed URL Format:

https://www.youtube.com/feeds/videos.xml?channel_id={CHANNEL_ID}

Database Migrations

This project uses Alembic for database schema migrations.

Create a new migration after model changes:

source .venv/bin/activate && alembic revision --autogenerate -m "Description of changes"

Apply migrations:

source .venv/bin/activate && alembic upgrade head

View migration history:

source .venv/bin/activate && alembic history

Rollback to previous version:

source .venv/bin/activate && alembic downgrade -1

Migration files location: alembic/versions/

Important notes:

Always review auto-generated migrations before applying
The database is automatically initialized on Flask app startup via init_db()
Migration configuration is in alembic.ini and alembic/env.py
Models are imported in alembic/env.py for autogenerate support

Database Schema

channels table:

id: Primary key
channel_id: YouTube channel ID (unique, indexed)
title: Channel title
link: Channel URL
last_fetched: Timestamp of last feed fetch

video_entries table:

id: Primary key
channel_id: Foreign key to channels.id
title: Video title
link: Video URL (unique)
created_at: Timestamp when video was first recorded
download_status: Enum (pending, downloading, completed, failed)
download_path: Local file path to downloaded MP4
download_started_at: When download began
download_completed_at: When download finished
download_error: Error message if download failed
file_size: Size in bytes of downloaded file
Index: idx_channel_created on (channel_id, created_at) for fast queries
Index: idx_download_status on download_status for filtering

Video Download System

The application uses Celery with Redis for asynchronous video downloads:

Download Workflow:

User triggers download via /api/download/<video_id> (POST)
VideoEntry status changes to "downloading"
Celery worker picks up task and uses yt-dlp to download as MP4
Progress updates written to database
On completion, status changes to "completed" with file path
On failure, status changes to "failed" with error message (auto-retry 3x)

yt-dlp Configuration:

Format: bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best
Output format: MP4 (converted if necessary using FFmpeg)
Output location: downloads/<video_id>_<title>.mp4
Progress hooks for real-time status updates

Requirements:

Redis server must be running (localhost:6379)
Celery worker must be running to process downloads
FFmpeg recommended for format conversion (yt-dlp will use it if available)

Dependencies

Flask 3.1.2+: Web framework
feedparser 6.0.12+: RSS/Atom feed parsing
SQLAlchemy 2.0.0+: ORM for database operations
Alembic 1.13.0+: Database migration tool
Celery 5.3.0+: Distributed task queue for async jobs
Redis 5.0.0+: Message broker for Celery
yt-dlp 2024.0.0+: YouTube video downloader
Python 3.14+: Required runtime version

8.9 KiB Raw Blame History