Case study
VNGRS — VLM training-data pipeline
[01] batu@batu0:~/case-studies/vngrs
VNGRS needed a multi-million-image training dataset for a Vision-Language Model, drawn from many heterogeneous sources — HuggingFace, Kaggle, GitHub releases, direct image URLs, client-provided image folders, and scraped PDFs from public archives. The data had to be normalized, categorized, sampled to target distributions, and concatenated into a single reproducible corpus — with the pipeline itself runnable on AWS EC2 and resumable across days of execution.
I contracted in as the data engineer for this effort, Dec 2025 – Jan 2026, ~32 active days and ~295 commits.
The problem
A VLM training run needs millions of images with clean category / subcategory / provenance metadata. The upstream data lives in incompatible shapes: some of it is already in HuggingFace Arrow datasets, some is in Kaggle image archives, some is a GitHub release with loose PNGs, some is URLs in a CSV, some is PDFs from scraped library archives. Building toward a 10M-sample target under per-subcategory quotas — and keeping the whole thing resumable on EC2 — is where the engineering lives.
What I built
A config-driven, resumable Python pipeline in ~/docs/vngrs:
Multi-source downloaders
Pluggable BaseDownloader with concrete HFDownloader, HFImageURLsDownloader,
KaggleDownloader, GitHubDownloader, and a parallel direct-image URL downloader. Per-source
cooldowns and rate-limit handling; cache-first-then-force-redownload strategy for HuggingFace;
automatic multi-config dataset handling.
Format-normalization layer
Every source type is ingested into the same HuggingFace Dataset / DatasetDict shape, with
category / subcategory / source provenance preserved in a per-dataset metadata.json. Dedicated
ingestion paths for HF datasets, Kaggle image archives, PDF→image outputs, client-provided
folders (VNGRSHFImageDatasetBuilder), and raw-image-URL dumps.
Scrapy spider + concurrent PDF→image
A Scrapy spider (IstanbulUniNewspapersSpider) to discover and download historical newspaper
PDFs from the Istanbul University library newspaper archive, followed by a concurrent
pdf2image + Poppler stage at 300 DPI, multiprocessed across all CPUs with per-file timing
statistics.
Subcategory allocator + concatenator
Deterministic, seed-controlled, index-based sampling — explicitly replacing
datasets.shuffle() after reproducibility issues — that distributes a target sample count
(10,000,000) across datasets of different sizes under per-subcategory quota constraints. Audit
layer cross-checks the concatenated corpus against the source registry to confirm every declared
source actually landed in the output.
Operational hardening
- YAML config (
config/config.yaml) for both local and AWS paths. StateManagerJSON files for resumable runs.single_instancelock files to prevent double-runs.- Named-logger tree with per-subsystem log files,
tqdm-redirected-to-logging for long jobs. dry_runandtest_modeflags.- Deployed onto an EC2 instance with externally-mounted EBS volumes, SSHFS from the laptop for
inspection, and S3 backups of intermediate state (
s3://cache-.../vlm-data/…).
Stack
Python · HuggingFace datasets (Arrow, shards, concat, save_to_disk) · Scrapy · kagglehub ·
pdf2image + Poppler · Pandas · ProcessPoolExecutor · PyYAML · python-dotenv · AWS EC2 ·
AWS S3 · SSHFS · Linux · Git.
Scale & outcomes
- Target corpus: 10,000,000 image samples across the project’s category tree. Git history shows successive runs at 1k → 10k → 100k → 1M → 10M.
- Arrow shard size tuned from 1 GB → 2 GB → 10 GB as the concatenated output grew.
- Data sources integrated: HuggingFace datasets, Kaggle datasets, GitHub releases, direct image-URL dumps, scraped newspaper PDFs, and client-provided image folders — all reduced to a single HF-Arrow-on-disk layout.
- Resumable across restarts, deterministic sampling (reproducible with a fixed seed), auditable (source-vs-output cross-check), single-instance-safe.