← home · /hire

Case study

VNGRS — VLM training-data pipeline

VNGRS (Turkish AI consultancy) · Dec 2025 – Jan 2026 · Python · HuggingFace datasets · Scrapy · pdf2image · AWS

[01] batu@batu0:~/case-studies/vngrs

VNGRS needed a multi-million-image training dataset for a Vision-Language Model, drawn from many heterogeneous sources — HuggingFace, Kaggle, GitHub releases, direct image URLs, client-provided image folders, and scraped PDFs from public archives. The data had to be normalized, categorized, sampled to target distributions, and concatenated into a single reproducible corpus — with the pipeline itself runnable on AWS EC2 and resumable across days of execution.

I contracted in as the data engineer for this effort, Dec 2025 – Jan 2026, ~32 active days and ~295 commits.

The problem

A VLM training run needs millions of images with clean category / subcategory / provenance metadata. The upstream data lives in incompatible shapes: some of it is already in HuggingFace Arrow datasets, some is in Kaggle image archives, some is a GitHub release with loose PNGs, some is URLs in a CSV, some is PDFs from scraped library archives. Building toward a 10M-sample target under per-subcategory quotas — and keeping the whole thing resumable on EC2 — is where the engineering lives.

What I built

A config-driven, resumable Python pipeline in ~/docs/vngrs:

Multi-source downloaders

Pluggable BaseDownloader with concrete HFDownloader, HFImageURLsDownloader, KaggleDownloader, GitHubDownloader, and a parallel direct-image URL downloader. Per-source cooldowns and rate-limit handling; cache-first-then-force-redownload strategy for HuggingFace; automatic multi-config dataset handling.

Format-normalization layer

Every source type is ingested into the same HuggingFace Dataset / DatasetDict shape, with category / subcategory / source provenance preserved in a per-dataset metadata.json. Dedicated ingestion paths for HF datasets, Kaggle image archives, PDF→image outputs, client-provided folders (VNGRSHFImageDatasetBuilder), and raw-image-URL dumps.

Scrapy spider + concurrent PDF→image

A Scrapy spider (IstanbulUniNewspapersSpider) to discover and download historical newspaper PDFs from the Istanbul University library newspaper archive, followed by a concurrent pdf2image + Poppler stage at 300 DPI, multiprocessed across all CPUs with per-file timing statistics.

Subcategory allocator + concatenator

Deterministic, seed-controlled, index-based sampling — explicitly replacing datasets.shuffle() after reproducibility issues — that distributes a target sample count (10,000,000) across datasets of different sizes under per-subcategory quota constraints. Audit layer cross-checks the concatenated corpus against the source registry to confirm every declared source actually landed in the output.

Operational hardening

  • YAML config (config/config.yaml) for both local and AWS paths.
  • StateManager JSON files for resumable runs.
  • single_instance lock files to prevent double-runs.
  • Named-logger tree with per-subsystem log files, tqdm-redirected-to-logging for long jobs.
  • dry_run and test_mode flags.
  • Deployed onto an EC2 instance with externally-mounted EBS volumes, SSHFS from the laptop for inspection, and S3 backups of intermediate state (s3://cache-.../vlm-data/…).

Stack

Python · HuggingFace datasets (Arrow, shards, concat, save_to_disk) · Scrapy · kagglehub · pdf2image + Poppler · Pandas · ProcessPoolExecutor · PyYAML · python-dotenv · AWS EC2 · AWS S3 · SSHFS · Linux · Git.

Scale & outcomes

  • Target corpus: 10,000,000 image samples across the project’s category tree. Git history shows successive runs at 1k → 10k → 100k → 1M → 10M.
  • Arrow shard size tuned from 1 GB → 2 GB → 10 GB as the concatenated output grew.
  • Data sources integrated: HuggingFace datasets, Kaggle datasets, GitHub releases, direct image-URL dumps, scraped newspaper PDFs, and client-provided image folders — all reduced to a single HF-Arrow-on-disk layout.
  • Resumable across restarts, deterministic sampling (reproducible with a fixed seed), auditable (source-vs-output cross-check), single-instance-safe.

← back to work · see /hire for similar engagements