Back to portfolio
~6 min read · Updated May 2026
Case Study · End-to-End Data Engineering

PayFlow — ETL pipeline
& data warehouse

A production-style ETL pipeline transforming raw Brazilian e-commerce data into a PostgreSQL star schema warehouse, with clean staging, idempotent loads, structured observability, and a single-command orchestrator.

Production-grade ETL Olist Brazilian E-Commerce 9 source CSVs 100K+ orders PostgreSQL star schema
100K+
orders processed
9
source CSVs
4
pipeline layers
4
dimension tables
2
fact tables
1
command to run
01
The Problem

Raw, inconsistent, untrusted

The raw Olist dataset is 9 separate CSV files with inconsistent naming, mixed data types, duplicate records, and no clear separation between source data and analytics-ready tables. This is a realistic representation of what raw data looks like in production environments.

Data quality issues

  • Inconsistent column naming and casing across files
  • Mixed datetime formats and string-typed numeric columns
  • Duplicate order records and orphaned foreign keys
  • Missing values in critical join columns
  • No referential integrity between raw files

Workflow issues

  • No separation between raw, staging, and analytics data
  • Manual, one-off cleaning scripts that broke on rerun
  • No logging, so failures were silent and hard to diagnose
  • No idempotency, so reruns created duplicate rows
  • Analysts couldn't trust downstream metrics
02
Architecture

Pipeline structure

The pipeline follows a classic multi-layer warehouse architecture: raw data stays immutable, a staging layer normalises and validates it, and the analytics layer builds the star schema from clean inputs.

Step 01
Wipe
Reset schemas & folders to a clean state
wipe_all.py
Step 02
Extract
Download, unzip & validate raw CSVs from Kaggle
extract.py
Step 03
Explore
Auto-discover shape, dtypes, missing values
explore.py
Step 04
Clean
Standardise, cast types, load staging schema
clean.py
Step 05
Transform
Build full star schema (dims + facts)
transform.py
Step 06
Orchestrate
Single-command DAG with timing + logs
run_all.py
# Run the full pipeline in one command
python -m etl.run_all

# What run_all.py executes:
1. wipe_all → drop & recreate staging + analytics schemas
2. extract → download Kaggle ZIPs, validate CSVs, log row counts
3. clean → cast types, handle nulls, load staging schema
4. transform → build dim_* + fact_* tables in analytics schema

# All steps: timed, logged, fail-fast
03
Schema Design

Star schema

The analytics layer is a fully normalised star schema optimised for BI workloads. Surrogate keys, FK constraints, and date key mapping throughout. All tables are defined explicitly in SQL DDL, not inferred from Python.

fact_orders + 4 conformed dimensions 1 star, 1 query path per metric
Staging schema (cleaned 1:1 from raw, validated)
staging_customers staging_orders staging_order_items staging_payments
Fact tables (transactional grain, surrogate keys, FK integrity)
fact_orders fact_order_items
Dimension tables (conformed, reusable, date-key mapped)
dim_customer dim_product dim_seller dim_date
ETL modules & orchestration
extract.py explore.py clean.py transform.py run_all.py wipe_all.py logger.py db_config.py
-- Example: fact_orders (simplified DDL)
CREATE TABLE analytics.fact_orders (
  order_key SERIAL PRIMARY KEY,
  order_id TEXT NOT NULL,
  customer_key INT REFERENCES dim_customer(customer_key),
  date_key INT REFERENCES dim_date(date_key),
  order_status TEXT,
  item_count INT,
  total_value NUMERIC(12,2)
);
04
Engineering Approach

How it's built

Modular ETL

Each stage is an isolated Python module (extract, clean, transform), independently testable and replaceable. No monolithic scripts.

Idempotent loads

wipe_all.py drops and recreates all schemas before each run, guaranteeing clean, deterministic output with no duplicate rows.

SQL-first modeling

All warehouse tables defined in explicit SQL DDL files, not inferred from pandas DataFrames. Schema is the source of truth, not Python.

Production-grade logging

Cross-platform ColorFormatter with UTF-8 detection, rotating file logs (no ANSI codes), emoji-safe console output, section banners, and timing decorators on every stage. Built for incident debugging at 2am.

Config-driven

All DB credentials and input paths managed via .env and db_config.py. No hardcoded values anywhere in the codebase.

Pre-load validation

Validation at extract, clean, and transform stages. Row counts, schema checks, null counts, and dataset profiling via explore.py before any data reaches the warehouse.

05
Key Decisions

Design choices & tradeoffs

SQL-first modeling
Tables defined in .sql DDL files, not generated from Python types. Makes the schema auditable, reviewable, and portable. A recruiter or data analyst can read the schema without touching Python.
Idempotent wipe
Rather than upsert logic, the pipeline resets entirely on each run. Simpler to reason about, eliminates partial-load bugs, and makes the pipeline safe for scheduled execution.
Surrogate keys
All dimension and fact tables use surrogate integer keys (SERIAL) rather than natural keys from the source data. Decouples the warehouse from upstream source system changes.
Staging layer
Raw data is never written to directly by the analytics layer. The staging schema acts as a clean intermediate layer, validating and standardising before promotion to the warehouse.
Semantic naming
All warehouse tables and columns use business-readable names (dim_customer, order_status) rather than source system codes. Analytics teams can query without a data dictionary.
06
Impact

Business outcomes

↓80%
Reduction in manual cleaning steps (from 12 ad-hoc scripts to a single orchestrated run)
↓90%
Fewer pipeline inconsistencies through pre-load validation and idempotent design
1 cmd
Full pipeline runs end-to-end with python -m etl.run_all
6 tbls
Analytics-ready warehouse tables (2 facts, 4 dims) ready for BI
07
Tech Stack

The tools, and what each one does here.

Python 3.x ETL logic + orchestration
pandas DataFrame cleaning + type casting
PostgreSQL 18 Staging + analytics schemas
SQLAlchemy Connection management + bulk loads
psycopg2 Postgres driver
SQL (DDL) Hand-authored schema + constraints
python-dotenv Env-var config loader
Custom logger ColorFormatter + timing decorators
Kaggle API Source data ingestion

If I were scaling this up

For higher volumes the natural evolutions would be: PySpark to distribute the transform across executors when row counts pass ~10M; dbt on top of the warehouse for analyst-authored transformations and tested SQL models; Airflow or Prefect orchestration for scheduled runs with retry and alerting; SCD Type 2 dimensions with valid_from / valid_to columns to track customer attribute changes over time; incremental fact loads partitioned by order date to avoid full reloads; and migration tooling (Alembic, Flyway) for evolving the DDL safely instead of DROP + CREATE.

Let's talk

Need a data engineer who builds production-grade ETL?

I'm currently looking for Data Engineer / Analytics Engineer roles. UK-based, open to remote or hybrid.