The Architecture Behind an AI Video Processing Pipeline
Building a video processing service that handles everything from YouTube download to AI-scored, captioned, face-tracked vertical clips involves a lot of moving parts. This post is a straight-up arc...

Source: DEV Community
Building a video processing service that handles everything from YouTube download to AI-scored, captioned, face-tracked vertical clips involves a lot of moving parts. This post is a straight-up architecture breakdown — the components, how they talk to each other, and the design decisions that actually matter at scale. This is the architecture running ClipSpeedAI. System Overview At the highest level, the pipeline is: User submits YouTube URL → Download job queued → Video downloaded (yt-dlp) → Audio extracted (FFmpeg) → Transcription (Whisper API) → Clip scoring (GPT-4o) → Face detection (MediaPipe, Python child process) → Clip extraction + crop (FFmpeg) → Caption generation + burn (Whisper segments → FFmpeg drawtext) → Output clips uploaded to storage → User notified via WebSocket Each stage is a discrete job in a Bull queue. Stages are not chained synchronously — they produce events that trigger the next stage. This means any stage can fail and retry independently. Component Map ┌────