Back to portfolio
~7 min read · Updated May 2026
Case Study · Layered Data Platform

ChocoDelight Data Platform
& analytics warehouse

A production-grade ETL and Data Warehouse pipeline I designed and built to turn a flat, messy chocolate sales dataset into a structured PostgreSQL warehouse. Three schemas (raw, operational, analytics), modular ETL with feature engineering, surrogate keys, foreign-key validation, and a single-command orchestrator. Built to be reliable, reproducible, maintainable, analytics-ready, and observable. Not just to transform CSVs.

Layered architecture 1M sales rows Kaggle source dataset 3-schema design Star schema + FK constraints PostgreSQL 14+
1M
sales rows processed
3
schema layers
5
dimension tables
1
fact table
5
FK validations
~7m
end-to-end runtime
01
The Problem

Flat, messy, undocumented

The source data is a single Kaggle dataset with five flat CSV files (calendar, customers, products, sales, stores). It is the kind of dataset that looks tidy on the surface but breaks the moment you try to model it for analytics. Inconsistent IDs, missing reference rows, no surrogate keys, and zero separation between source data and analytical structure.

Source data issues

  • Mixed casing and inconsistent text values across categorical columns
  • Missing product records that appeared in sales but not in products
  • No surrogate keys, only natural string IDs prone to drift
  • Numeric metrics stored as strings with currency symbols
  • No referential integrity enforced anywhere

Modelling gaps

  • No separation between raw, operational, and analytics layers
  • No engineered features, just raw values exposed to analysts
  • No date dimension or calendar attributes for time-based analysis
  • No customer segmentation or product tier classification
  • No outlier detection or margin bucketing for BI workloads
02
Architecture

Three-layer pipeline

The platform is built around a layered architecture that separates concerns at every stage. Raw data is preserved untouched, the operational layer stores cleaned and standardised tables, and the analytics layer is the dimensional warehouse used for reporting. Each layer has its own PostgreSQL schema and its own loader module.

Step 01
Wipe
Reset folders and reset all 3 schemas
wipe_all.py
Step 02
Extract
Download dataset via Kaggle CLI, unzip, validate
extract.py
Step 03
Clean
Standardise, type cast, engineer features, write cleaned CSVs
clean.py
Step 04
Transform
Build dim & fact tables, validate FK integrity
transform.py
Step 05
Load Layers
Load raw → operational → analytics in order
load_*_schema.py
Step 06
Orchestrate
Single-command pipeline runner with structured logs
run_all.py
# Run the entire pipeline end-to-end
python -m etl.run_all

# What run_all.py executes, in order:
1. wipe_all → drop folders + reset all 3 schemas
2. extract → Kaggle CLI download, unzip, log row counts
3. clean → standardise, type cast, engineer features
4. transform → build dim & fact tables, validate FKs
5. load_raw_schema → raw_data schema (source fidelity)
6. load_operationals_schema → operationals schema (cleaned)
7. load_analytics_schema → analytics schema (dim + fact)

# Every step: timed, logged via custom ColorFormatter, fail-fast
03
Schema Design

Three schemas, one warehouse

Each schema has a clear, single responsibility. Raw is for source fidelity and audit. Operational is for cleaned, standardised data with engineered features but no modelling. Analytics is the dimensional warehouse, indexed and ready for BI workloads.

raw_data schema (source fidelity, untouched)
raw_data.calendar raw_data.customers raw_data.products raw_data.sales raw_data.stores
operationals schema (cleaned, standardised, feature-engineered)
operationals.cleaned_calendar operationals.cleaned_customers operationals.cleaned_products operationals.cleaned_sales operationals.cleaned_stores operationals.cleaned_regions
analytics schema (dimensional model, BI-ready)
analytics.dim_calendar analytics.dim_customers analytics.dim_products analytics.dim_stores analytics.dim_regions analytics.fact_sales
-- Star schema lineage (mostly star, one intentional snowflake)

dim_calendar ─┐
dim_customers ─┤
dim_products ─┼──> fact_sales
dim_stores ─┤
dim_regions ─┘
↑ direct region_key FK
for "revenue by region" queries

-- Five FK validations run in transform.py before fact load
-- DB enforces FK RESTRICT + 7 CHECK constraints + 13 indexes
04
Feature Engineering

From raw values to analytics-ready features

The cleaning stage is not just type casting and null handling. Each table gets domain-specific engineered features that make downstream BI work meaningful, not mechanical.

Customers

  • Tenure metrics: days, months, years from join date
  • Customer segment derived from tenure thresholds
  • Standardised gender and demographic fields
  • Surrogate customer_key for the dim table

Calendar

  • Renamed and re-parsed date columns for safe joins
  • Quarter, season, day type derived from date
  • Formatted date strings for display layers
  • Surrogate calendar_key for FK joins

Products

  • Standardised text fields and categorical values
  • Brand tier classification logic
  • Backfilled missing product records (P0000, P0201)
  • Surrogate product_key for the dim table

Stores

  • Country-to-region mapping (6 continental regions)
  • Region tier classification by market
  • Region promoted to its own dim during transform
  • Surrogate store_key + region_key FK

Sales

  • Coerced numeric metrics (revenue, cost, margin)
  • Revenue buckets and margin buckets
  • Outlier flag for unusually large transactions
  • Time of day categorisation from order timestamp

Cross-cutting

  • Foreign key validation before fact construction
  • Indexes on dim natural keys and fact FKs
  • Indexes on analytical filters (revenue_bucket, time_of_day)
  • FK constraints enforced at the DDL level
05
Key Decisions

Design choices & tradeoffs

Three explicit schemas
Raw, operational, and analytics each have their own schema rather than collapsing into one. This makes lineage obvious, makes auditing trivial, and lets each layer be wiped or rebuilt independently. Slightly more storage cost, much clearer mental model.
Wipe-then-load
Each pipeline run drops and recreates all schemas before loading. No upsert logic, no partial-load bugs, fully deterministic output. Trade-off: not suitable for very large datasets, but ideal for project-scale reproducibility and CI testing.
Positional surrogate keys
Every dim gets its surrogate key from df.sort_values(natural_id).reset_index(drop=True).index + 1. Deterministic for the same input, simple, fast. Trade-off: keys shift if new rows are added, so this only works because the warehouse is rebuilt from scratch every run (wipe-and-load). For a warehouse that merged data over time, content-based hashing (e.g. SHA-256 of natural keys) would be the right call instead.
Feature engineering in cleaning
Engineered features (tenure, segments, buckets, time of day) are computed in the cleaning stage, not in BI tools. Analysts get clean, semantically meaningful columns from day one, and the logic lives in version-controlled Python rather than SQL workbooks.
Modular loaders
Each schema has its own loader module (load_raw_schema, load_operationals_schema, load_analytics_schema). Each one can be run independently for debugging, and the orchestrator just composes them. No god-script.
FK validation before fact load
Before loading fact_sales, the transform stage runs five separate validate_foreign_keys() checks: sales→customers, sales→products, sales→stores, sales→regions, sales→calendar. If any fails, the pipeline halts with sys.exit(1) and a structured error. Catches referential integrity issues before they reach the warehouse.
DB constraints, not just types
The analytics schema (sql/analytics_schema.sql) defines five FOREIGN KEY ... ON DELETE RESTRICT constraints, seven CHECK constraints (age ≥ 0, quantity ≥ 0, order_date ≤ CURRENT_DATE, prices ≥ 0), UNIQUE indexes on every natural ID, and 13 indexes across the fact and dims. PostgreSQL enforces the integrity, not Python. Try to delete a dimension row that's still referenced by a fact and Postgres refuses. The warehouse defends itself.
Region denormalised for query speed
The schema is mostly a pure star, with one intentional snowflake: fact_sales has a direct region_key FK to dim_regions, AND dim_stores also has a region_key FK to the same dim. The most common analytical query against this warehouse is "revenue by region", and the direct fact-to-region path saves an extra join. The full normalised path via dim_stores is still there for queries that need other store attributes. Documented snowflake, not accidental snowflake.
Production-grade logging
The logging system is a deliberate engineering layer, not an afterthought. Cross-platform UTF-8 detection (Windows Terminal, VS Code, PowerShell 7+, macOS/Linux) decides whether the console can render emojis; dual handlers route colorized output to the terminal and clean UTF-8 to a rotating file (no ANSI codes in logs); a section() helper visually separates ETL stages; a timed() decorator logs how long each stage takes; SQLAlchemy engine noise is suppressed. Result: observability that holds up at 2am, not just on the demo screen.
06
Outcomes

What the platform enables

3 schemas
Clean separation between raw, operational, and analytics layers
10+ feats
Engineered features available to BI on day one (segments, buckets, tiers)
1 cmd
Full pipeline runs end-to-end with python -m etl.run_all
5 tbls
Analytics-ready dim + fact tables, FK-validated and indexed

What this warehouse can answer

The dimensional model directly supports the kind of questions a sales or growth team would ask of a chocolate-retail dataset. Each one is a single SQL query against the analytics schema, not a multi-step pivot in a spreadsheet.

Revenue analysis

  • Revenue by product, store, or region
  • Margin analysis with bucket segmentation
  • Outlier flagging for unusually large transactions

Customer behaviour

  • Customer segmentation by tenure
  • Cohort comparison across joining periods
  • Repeat-buyer identification via fact_sales

Time & seasonality

  • Quarter and season-based performance
  • Day-type and time-of-day sales patterns
  • Year-over-year trend analysis via dim_calendar
07
Tech Stack

The tools, and what each one does here.

Python 3.x ETL + transforms orchestration
pandas Cleaning + feature engineering
PostgreSQL 18 Three-schema warehouse
SQLAlchemy DB engine + FK validation
psycopg2 Postgres driver
SQL (DDL) Layered schema design
python-dotenv Env-var config loader
Custom logger ColorFormatter + structured logs
Kaggle CLI Source data download

If I were scaling this up

For production scale the natural evolutions would be: dbt on top of the analytics schema for analyst-authored transformations with tested SQL models; Airflow or Prefect orchestration for scheduled refreshes with retry and alerting; materialised views for the most-queried analytics aggregates; SCD Type 2 dimensions with valid_from / valid_to tracking for time-aware product and category history; incremental loads against the operational layer instead of full refreshes; and migration tooling (Alembic, Flyway) for evolving the three-schema DDL safely.

Let's talk

Need a data engineer who can layer a warehouse properly?

I'm currently looking for Data Engineer / Analytics Engineer roles. UK-based, open to remote or hybrid.