Config-driven, resumable Python pipeline ingesting from Hugging Face, Kaggle, GitHub, direct image URLs, and scraped PDFs into a single HuggingFace Arrow dataset. Deterministic sampler allocates toward a 10M-sample target; state machine, single-instance locks, AWS EC2/S3 execution. ~295 commits.
Python · HuggingFace datasets · Scrapy · pdf2image · AWS