Case study
First Turkish VLM — training-data pipeline
[01] batu@batu0:~/case-studies/vngrs
The client needed a multi-million-image training dataset for the first Turkish Vision-Language Model, drawn from many heterogeneous sources. The data had to be normalized, categorized, sampled to target distributions, and concatenated into a single reproducible corpus — with the pipeline runnable on AWS EC2 and resumable across days of execution.
I contracted in as the data engineer for this effort, Dec 2025 – Jan 2026, ~32 active days and ~295 commits.
The problem
A VLM training run needs millions of images with clean category / subcategory / provenance metadata. The upstream data lives in incompatible shapes across multiple source types — hosted datasets, image archives, loose files, URL lists, and scraped PDFs. Building toward a 10M-sample target under per-subcategory quotas, and keeping the whole thing resumable on EC2, is where the engineering lives.
What I built
A config-driven, resumable Python pipeline with pluggable source adapters.
Multi-source ingestion
Pluggable downloader architecture covering hosted ML datasets, image archives, direct URL lists, and scraped document corpora. Per-source rate-limit handling, cache-first strategies, and automatic format normalization into a unified on-disk layout with provenance metadata preserved per dataset.
PDF corpus scraping
A Scrapy spider to discover and download historical newspaper PDFs from a public library archive, followed by a concurrent PDF-to-image conversion stage at 300 DPI, multiprocessed across all CPUs.
Deterministic sampler + concatenator
Seed-controlled, index-based sampling that distributes a 10M-sample target across datasets of varying sizes under per-subcategory quota constraints. An audit layer cross-checks the concatenated output against the source registry to confirm every declared source landed in the final corpus.
Operational hardening
YAML-driven config for local and cloud paths; JSON state files for resumable runs; single-instance locking; named per-subsystem loggers; dry-run and test-mode flags. Deployed on EC2 with S3 for intermediate state and backups.
Stack
Python · HuggingFace datasets · Scrapy · pdf2image · Pandas · AWS EC2 · AWS S3 · Git.
Scale & outcomes
- Target corpus: 10,000,000 image samples across a multi-level category tree.
- Multiple heterogeneous source types reduced to a single normalized on-disk layout.
- Resumable across restarts, deterministic (reproducible with a fixed seed), auditable (source-vs-output cross-check), single-instance-safe.