← home · /hire

Case study

First Turkish VLM — training-data pipeline

Undisclosed (Turkish AI consultancy) · Dec 2025 – Jan 2026 · Python · HuggingFace datasets · Scrapy · pdf2image · AWS

[01] batu@batu0:~/case-studies/vngrs

The client needed a multi-million-image training dataset for the first Turkish Vision-Language Model, drawn from many heterogeneous sources. The data had to be normalized, categorized, sampled to target distributions, and concatenated into a single reproducible corpus — with the pipeline runnable on AWS EC2 and resumable across days of execution.

I contracted in as the data engineer for this effort, Dec 2025 – Jan 2026, ~32 active days and ~295 commits.

The problem

A VLM training run needs millions of images with clean category / subcategory / provenance metadata. The upstream data lives in incompatible shapes across multiple source types — hosted datasets, image archives, loose files, URL lists, and scraped PDFs. Building toward a 10M-sample target under per-subcategory quotas, and keeping the whole thing resumable on EC2, is where the engineering lives.

What I built

A config-driven, resumable Python pipeline with pluggable source adapters.

Multi-source ingestion

Pluggable downloader architecture covering hosted ML datasets, image archives, direct URL lists, and scraped document corpora. Per-source rate-limit handling, cache-first strategies, and automatic format normalization into a unified on-disk layout with provenance metadata preserved per dataset.

PDF corpus scraping

A Scrapy spider to discover and download historical newspaper PDFs from a public library archive, followed by a concurrent PDF-to-image conversion stage at 300 DPI, multiprocessed across all CPUs.

Deterministic sampler + concatenator

Seed-controlled, index-based sampling that distributes a 10M-sample target across datasets of varying sizes under per-subcategory quota constraints. An audit layer cross-checks the concatenated output against the source registry to confirm every declared source landed in the final corpus.

Operational hardening

YAML-driven config for local and cloud paths; JSON state files for resumable runs; single-instance locking; named per-subsystem loggers; dry-run and test-mode flags. Deployed on EC2 with S3 for intermediate state and backups.

Stack

Python · HuggingFace datasets · Scrapy · pdf2image · Pandas · AWS EC2 · AWS S3 · Git.

Scale & outcomes

  • Target corpus: 10,000,000 image samples across a multi-level category tree.
  • Multiple heterogeneous source types reduced to a single normalized on-disk layout.
  • Resumable across restarts, deterministic (reproducible with a fixed seed), auditable (source-vs-output cross-check), single-instance-safe.

← back to work · see /hire for similar engagements