~7 min read · Updated May 2026

Case Study · Layered Data Platform

ChocoDelight Data Platform
& analytics warehouse

A production-grade ETL and Data Warehouse pipeline I designed and built to turn a flat, messy chocolate sales dataset into a structured PostgreSQL warehouse. Three schemas (raw, operational, analytics), modular ETL with feature engineering, surrogate keys, foreign-key validation, and a single-command orchestrator. Built to be reliable, reproducible, maintainable, analytics-ready, and observable. Not just to transform CSVs.

Layered architecture 1M sales rows Kaggle source dataset 3-schema design Star schema + FK constraints PostgreSQL 14+

sales rows processed

schema layers

dimension tables

fact table

FK validations

~7m

end-to-end runtime

View on GitHub Read the story

The Problem

Flat, messy, undocumented

The source data is a single Kaggle dataset with five flat CSV files (calendar, customers, products, sales, stores). It is the kind of dataset that looks tidy on the surface but breaks the moment you try to model it for analytics. Inconsistent IDs, missing reference rows, no surrogate keys, and zero separation between source data and analytical structure.

Source data issues

Mixed casing and inconsistent text values across categorical columns
Missing product records that appeared in sales but not in products
No surrogate keys, only natural string IDs prone to drift
Numeric metrics stored as strings with currency symbols
No referential integrity enforced anywhere

Modelling gaps

No separation between raw, operational, and analytics layers
No engineered features, just raw values exposed to analysts
No date dimension or calendar attributes for time-based analysis
No customer segmentation or product tier classification
No outlier detection or margin bucketing for BI workloads

Architecture

Three-layer pipeline

The platform is built around a layered architecture that separates concerns at every stage. Raw data is preserved untouched, the operational layer stores cleaned and standardised tables, and the analytics layer is the dimensional warehouse used for reporting. Each layer has its own PostgreSQL schema and its own loader module.

Step 01

Wipe

Reset folders and reset all 3 schemas

wipe_all.py

Step 02

Extract

Download dataset via Kaggle CLI, unzip, validate

extract.py

Step 03

Clean

Standardise, type cast, engineer features, write cleaned CSVs

clean.py

Step 04

Transform

Build dim & fact tables, validate FK integrity

transform.py

Step 05

Load Layers

Load raw → operational → analytics in order

load_*_schema.py

Step 06

Orchestrate

Single-command pipeline runner with structured logs

run_all.py

# Run the entire pipeline end-to-end

python -m etl.run_all

# What run_all.py executes, in order:

1. wipe_all                  →  drop folders + reset all 3 schemas

2. extract                   →  Kaggle CLI download, unzip, log row counts

3. clean                     →  standardise, type cast, engineer features

4. transform                 →  build dim & fact tables, validate FKs

5. load_raw_schema           →  raw_data schema (source fidelity)

6. load_operationals_schema  →  operationals schema (cleaned)

7. load_analytics_schema     →  analytics schema (dim + fact)

# Every step: timed, logged via custom ColorFormatter, fail-fast

Schema Design

Three schemas, one warehouse

Each schema has a clear, single responsibility. Raw is for source fidelity and audit. Operational is for cleaned, standardised data with engineered features but no modelling. Analytics is the dimensional warehouse, indexed and ready for BI workloads.

raw_data schema (source fidelity, untouched)

raw_data.calendar raw_data.customers raw_data.products raw_data.sales raw_data.stores

operationals schema (cleaned, standardised, feature-engineered)

operationals.cleaned_calendar operationals.cleaned_customers operationals.cleaned_products operationals.cleaned_sales operationals.cleaned_stores operationals.cleaned_regions

analytics schema (dimensional model, BI-ready)

analytics.dim_calendar analytics.dim_customers analytics.dim_products analytics.dim_stores analytics.dim_regions analytics.fact_sales

-- Star schema lineage (mostly star, one intentional snowflake)

dim_calendar   ─┐

dim_customers  ─┤

dim_products   ─┼──>  fact_sales

dim_stores     ─┤           │

dim_regions    ─┘           │

                          ↑ direct region_key FK

                            for "revenue by region" queries

-- Five FK validations run in transform.py before fact load

-- DB enforces FK RESTRICT + 7 CHECK constraints + 13 indexes

Feature Engineering

From raw values to analytics-ready features

The cleaning stage is not just type casting and null handling. Each table gets domain-specific engineered features that make downstream BI work meaningful, not mechanical.

Customers

Tenure metrics: days, months, years from join date
Customer segment derived from tenure thresholds
Standardised gender and demographic fields
Surrogate customer_key for the dim table

Calendar

Renamed and re-parsed date columns for safe joins
Quarter, season, day type derived from date
Formatted date strings for display layers
Surrogate calendar_key for FK joins

Products

Standardised text fields and categorical values
Brand tier classification logic
Backfilled missing product records (P0000, P0201)
Surrogate product_key for the dim table

Stores

Country-to-region mapping (6 continental regions)
Region tier classification by market
Region promoted to its own dim during transform
Surrogate store_key + region_key FK

Sales

Coerced numeric metrics (revenue, cost, margin)
Revenue buckets and margin buckets
Outlier flag for unusually large transactions
Time of day categorisation from order timestamp

Cross-cutting

Foreign key validation before fact construction
Indexes on dim natural keys and fact FKs
Indexes on analytical filters (revenue_bucket, time_of_day)
FK constraints enforced at the DDL level

Key Decisions

Design choices & tradeoffs

Three explicit schemas

Raw, operational, and analytics each have their own schema rather than collapsing into one. This makes lineage obvious, makes auditing trivial, and lets each layer be wiped or rebuilt independently. Slightly more storage cost, much clearer mental model.

Wipe-then-load

Each pipeline run drops and recreates all schemas before loading. No upsert logic, no partial-load bugs, fully deterministic output. Trade-off: not suitable for very large datasets, but ideal for project-scale reproducibility and CI testing.

Positional surrogate keys

Every dim gets its surrogate key from df.sort_values(natural_id).reset_index(drop=True).index + 1. Deterministic for the same input, simple, fast. Trade-off: keys shift if new rows are added, so this only works because the warehouse is rebuilt from scratch every run (wipe-and-load). For a warehouse that merged data over time, content-based hashing (e.g. SHA-256 of natural keys) would be the right call instead.

Feature engineering in cleaning

Engineered features (tenure, segments, buckets, time of day) are computed in the cleaning stage, not in BI tools. Analysts get clean, semantically meaningful columns from day one, and the logic lives in version-controlled Python rather than SQL workbooks.

Modular loaders

Each schema has its own loader module (load_raw_schema, load_operationals_schema, load_analytics_schema). Each one can be run independently for debugging, and the orchestrator just composes them. No god-script.

FK validation before fact load

Before loading fact_sales, the transform stage runs five separate validate_foreign_keys() checks: sales→customers, sales→products, sales→stores, sales→regions, sales→calendar. If any fails, the pipeline halts with sys.exit(1) and a structured error. Catches referential integrity issues before they reach the warehouse.

DB constraints, not just types

The analytics schema (sql/analytics_schema.sql) defines five FOREIGN KEY ... ON DELETE RESTRICT constraints, seven CHECK constraints (age ≥ 0, quantity ≥ 0, order_date ≤ CURRENT_DATE, prices ≥ 0), UNIQUE indexes on every natural ID, and 13 indexes across the fact and dims. PostgreSQL enforces the integrity, not Python. Try to delete a dimension row that's still referenced by a fact and Postgres refuses. The warehouse defends itself.

Region denormalised for query speed

The schema is mostly a pure star, with one intentional snowflake: fact_sales has a direct region_key FK to dim_regions, AND dim_stores also has a region_key FK to the same dim. The most common analytical query against this warehouse is "revenue by region", and the direct fact-to-region path saves an extra join. The full normalised path via dim_stores is still there for queries that need other store attributes. Documented snowflake, not accidental snowflake.

Production-grade logging

The logging system is a deliberate engineering layer, not an afterthought. Cross-platform UTF-8 detection (Windows Terminal, VS Code, PowerShell 7+, macOS/Linux) decides whether the console can render emojis; dual handlers route colorized output to the terminal and clean UTF-8 to a rotating file (no ANSI codes in logs); a section() helper visually separates ETL stages; a timed() decorator logs how long each stage takes; SQLAlchemy engine noise is suppressed. Result: observability that holds up at 2am, not just on the demo screen.

Outcomes

What the platform enables

3 schemas

Clean separation between raw, operational, and analytics layers

10+ feats

Engineered features available to BI on day one (segments, buckets, tiers)

1 cmd

Full pipeline runs end-to-end with python -m etl.run_all

5 tbls

Analytics-ready dim + fact tables, FK-validated and indexed

What this warehouse can answer

The dimensional model directly supports the kind of questions a sales or growth team would ask of a chocolate-retail dataset. Each one is a single SQL query against the analytics schema, not a multi-step pivot in a spreadsheet.

Revenue analysis

Revenue by product, store, or region
Margin analysis with bucket segmentation
Outlier flagging for unusually large transactions

Customer behaviour

Customer segmentation by tenure
Cohort comparison across joining periods
Repeat-buyer identification via fact_sales

Time & seasonality

Quarter and season-based performance
Day-type and time-of-day sales patterns
Year-over-year trend analysis via dim_calendar

Tech Stack

The tools, and what each one does here.

Python 3.x ETL + transforms orchestration

pandas Cleaning + feature engineering

PostgreSQL 18 Three-schema warehouse

SQLAlchemy DB engine + FK validation

psycopg2 Postgres driver

SQL (DDL) Layered schema design

python-dotenv Env-var config loader

Custom logger ColorFormatter + structured logs

Kaggle CLI Source data download

If I were scaling this up

For production scale the natural evolutions would be: dbt on top of the analytics schema for analyst-authored transformations with tested SQL models; Airflow or Prefect orchestration for scheduled refreshes with retry and alerting; materialised views for the most-queried analytics aggregates; SCD Type 2 dimensions with valid_from / valid_to tracking for time-aware product and category history; incremental loads against the operational layer instead of full refreshes; and migration tooling (Alembic, Flyway) for evolving the three-schema DDL safely.

Let's talk

Need a data engineer who can layer a warehouse properly?

I'm currently looking for Data Engineer / Analytics Engineer roles. UK-based, open to remote or hybrid.

View this project on GitHub Get in touch

Other case studies

Case study · Featured · Distributed ETL

FibbieBanks ETL & Warehouse

A 1M-row PySpark pipeline for synthetic banking transactions, with a validated star schema warehouse, deterministic SHA-256 surrogate keys, and idempotent re-runs on PostgreSQL.

PySpark PostgreSQL SQLAlchemy

Read the case study→

Case study · End-to-end ETL

PayFlow — ETL & Warehouse

A production-style ETL pipeline that loads 9 raw Brazilian e-commerce CSVs into a full PostgreSQL star schema, with idempotent loads and structured logging.

Python PostgreSQL SQLAlchemy