~6 min read · Updated May 2026

Case Study · End-to-End Data Engineering

PayFlow — ETL pipeline
& data warehouse

A production-style ETL pipeline transforming raw Brazilian e-commerce data into a PostgreSQL star schema warehouse, with clean staging, idempotent loads, structured observability, and a single-command orchestrator.

Production-grade ETL Olist Brazilian E-Commerce 9 source CSVs 100K+ orders PostgreSQL star schema

100K+

orders processed

source CSVs

pipeline layers

dimension tables

fact tables

command to run

View on GitHub Read the story

The Problem

Raw, inconsistent, untrusted

The raw Olist dataset is 9 separate CSV files with inconsistent naming, mixed data types, duplicate records, and no clear separation between source data and analytics-ready tables. This is a realistic representation of what raw data looks like in production environments.

Data quality issues

Inconsistent column naming and casing across files
Mixed datetime formats and string-typed numeric columns
Duplicate order records and orphaned foreign keys
Missing values in critical join columns
No referential integrity between raw files

Workflow issues

No separation between raw, staging, and analytics data
Manual, one-off cleaning scripts that broke on rerun
No logging, so failures were silent and hard to diagnose
No idempotency, so reruns created duplicate rows
Analysts couldn't trust downstream metrics

Architecture

Pipeline structure

The pipeline follows a classic multi-layer warehouse architecture: raw data stays immutable, a staging layer normalises and validates it, and the analytics layer builds the star schema from clean inputs.

Step 01

Wipe

Reset schemas & folders to a clean state

wipe_all.py

Step 02

Extract

Download, unzip & validate raw CSVs from Kaggle

extract.py

Step 03

Explore

Auto-discover shape, dtypes, missing values

explore.py

Step 04

Clean

Standardise, cast types, load staging schema

clean.py

Step 05

Transform

Build full star schema (dims + facts)

transform.py

Step 06

Orchestrate

Single-command DAG with timing + logs

run_all.py

# Run the full pipeline in one command

python -m etl.run_all

# What run_all.py executes:

1. wipe_all    →  drop & recreate staging + analytics schemas

2. extract     →  download Kaggle ZIPs, validate CSVs, log row counts

3. clean       →  cast types, handle nulls, load staging schema

4. transform   →  build dim_* + fact_* tables in analytics schema

# All steps: timed, logged, fail-fast

Schema Design

Star schema

The analytics layer is a fully normalised star schema optimised for BI workloads. Surrogate keys, FK constraints, and date key mapping throughout. All tables are defined explicitly in SQL DDL, not inferred from Python.

fact_orders + 4 conformed dimensions → 1 star, 1 query path per metric

Staging schema (cleaned 1:1 from raw, validated)

staging_customers staging_orders staging_order_items staging_payments

Fact tables (transactional grain, surrogate keys, FK integrity)

fact_orders fact_order_items

Dimension tables (conformed, reusable, date-key mapped)

dim_customer dim_product dim_seller dim_date

ETL modules & orchestration

extract.py explore.py clean.py transform.py run_all.py wipe_all.py logger.py db_config.py

-- Example: fact_orders (simplified DDL)

CREATE TABLE analytics.fact_orders (

    order_key        SERIAL PRIMARY KEY,

    order_id         TEXT NOT NULL,

    customer_key     INT REFERENCES dim_customer(customer_key),

    date_key         INT REFERENCES dim_date(date_key),

    order_status     TEXT,

    item_count       INT,

    total_value      NUMERIC(12,2)

);

Engineering Approach

How it's built

Modular ETL

Each stage is an isolated Python module (extract, clean, transform), independently testable and replaceable. No monolithic scripts.

Idempotent loads

wipe_all.py drops and recreates all schemas before each run, guaranteeing clean, deterministic output with no duplicate rows.

SQL-first modeling

All warehouse tables defined in explicit SQL DDL files, not inferred from pandas DataFrames. Schema is the source of truth, not Python.

Production-grade logging

Cross-platform ColorFormatter with UTF-8 detection, rotating file logs (no ANSI codes), emoji-safe console output, section banners, and timing decorators on every stage. Built for incident debugging at 2am.

Config-driven

All DB credentials and input paths managed via .env and db_config.py. No hardcoded values anywhere in the codebase.

Pre-load validation

Validation at extract, clean, and transform stages. Row counts, schema checks, null counts, and dataset profiling via explore.py before any data reaches the warehouse.

Key Decisions

Design choices & tradeoffs

SQL-first modeling

Tables defined in .sql DDL files, not generated from Python types. Makes the schema auditable, reviewable, and portable. A recruiter or data analyst can read the schema without touching Python.

Idempotent wipe

Rather than upsert logic, the pipeline resets entirely on each run. Simpler to reason about, eliminates partial-load bugs, and makes the pipeline safe for scheduled execution.

Surrogate keys

All dimension and fact tables use surrogate integer keys (SERIAL) rather than natural keys from the source data. Decouples the warehouse from upstream source system changes.

Staging layer

Raw data is never written to directly by the analytics layer. The staging schema acts as a clean intermediate layer, validating and standardising before promotion to the warehouse.

Semantic naming

All warehouse tables and columns use business-readable names (dim_customer, order_status) rather than source system codes. Analytics teams can query without a data dictionary.

Impact

Business outcomes

↓80%

Reduction in manual cleaning steps (from 12 ad-hoc scripts to a single orchestrated run)

↓90%

Fewer pipeline inconsistencies through pre-load validation and idempotent design

1 cmd

Full pipeline runs end-to-end with python -m etl.run_all

6 tbls

Analytics-ready warehouse tables (2 facts, 4 dims) ready for BI

Tech Stack

The tools, and what each one does here.

Python 3.x ETL logic + orchestration

pandas DataFrame cleaning + type casting

PostgreSQL 18 Staging + analytics schemas

SQLAlchemy Connection management + bulk loads

psycopg2 Postgres driver

SQL (DDL) Hand-authored schema + constraints

python-dotenv Env-var config loader

Custom logger ColorFormatter + timing decorators

Kaggle API Source data ingestion

If I were scaling this up

For higher volumes the natural evolutions would be: PySpark to distribute the transform across executors when row counts pass ~10M; dbt on top of the warehouse for analyst-authored transformations and tested SQL models; Airflow or Prefect orchestration for scheduled runs with retry and alerting; SCD Type 2 dimensions with valid_from / valid_to columns to track customer attribute changes over time; incremental fact loads partitioned by order date to avoid full reloads; and migration tooling (Alembic, Flyway) for evolving the DDL safely instead of DROP + CREATE.

Let's talk

Need a data engineer who builds production-grade ETL?

I'm currently looking for Data Engineer / Analytics Engineer roles. UK-based, open to remote or hybrid.

View this project on GitHub Get in touch

Other case studies

Case study · Featured · Distributed ETL

FibbieBanks ETL & Warehouse

A 1M-row PySpark pipeline for synthetic banking transactions, with a validated star schema warehouse, deterministic SHA-256 surrogate keys, and idempotent re-runs on PostgreSQL.

PySpark PostgreSQL SQLAlchemy

Read the case study→

Case study · Layered platform

ChocoDelight Data Platform

Three-schema PostgreSQL warehouse on chocolate sales data, with raw, operational, and analytics layers and a full dimensional model.

Python PostgreSQL SQLAlchemy

Read the case study→