A production-grade ETL and Data Warehouse pipeline I designed and built to turn a flat, messy
chocolate sales dataset into a structured PostgreSQL warehouse. Three schemas (raw, operational, analytics),
modular ETL with feature engineering, surrogate keys, foreign-key validation, and a single-command orchestrator.
Built to be reliable, reproducible, maintainable, analytics-ready, and observable. Not just to
transform CSVs.
The source data is a single Kaggle dataset with five flat CSV files (calendar, customers,
products, sales, stores). It is the kind of dataset that looks tidy on the surface but breaks the moment you try
to model it for analytics. Inconsistent IDs, missing reference rows, no surrogate keys, and zero separation
between source data and analytical structure.
Source data issues
Mixed casing and inconsistent text values across categorical columns
Missing product records that appeared in sales but not in products
No surrogate keys, only natural string IDs prone to drift
Numeric metrics stored as strings with currency symbols
No referential integrity enforced anywhere
Modelling gaps
No separation between raw, operational, and analytics layers
No engineered features, just raw values exposed to analysts
No date dimension or calendar attributes for time-based analysis
No customer segmentation or product tier classification
No outlier detection or margin bucketing for BI workloads
02
Architecture
Three-layer pipeline
The platform is built around a layered architecture that separates concerns at every
stage. Raw data is preserved untouched, the operational layer stores cleaned and standardised tables, and the
analytics layer is the dimensional warehouse used for reporting. Each layer has its own PostgreSQL schema and
its own loader module.
Step 01
Wipe
Reset folders and reset all 3 schemas
wipe_all.py
Step 02
Extract
Download dataset via Kaggle CLI, unzip, validate
extract.py
Step 03
Clean
Standardise, type cast, engineer features, write cleaned CSVs
clean.py
Step 04
Transform
Build dim & fact tables, validate FK integrity
transform.py
Step 05
Load Layers
Load raw → operational → analytics in order
load_*_schema.py
Step 06
Orchestrate
Single-command pipeline runner with structured logs
run_all.py
# Run the entire pipeline end-to-end python-metl.run_all
# What run_all.py executes, in order: 1.wipe_all→ drop folders + reset all 3
schemas 2.extract→ Kaggle CLI download, unzip,
log row counts 3.clean→ standardise, type cast,
engineer features 4.transform→ build dim & fact
tables, validate FKs 5.load_raw_schema→ raw_data schema
(source fidelity) 6.load_operationals_schema→
operationals schema (cleaned) 7.load_analytics_schema→ analytics
schema (dim + fact)
# Every step: timed, logged via custom ColorFormatter, fail-fast
03
Schema Design
Three schemas, one warehouse
Each schema has a clear, single responsibility. Raw is for source fidelity and audit.
Operational is for cleaned, standardised data with engineered features but no modelling. Analytics is the
dimensional warehouse, indexed and ready for BI workloads.
-- Star schema lineage (mostly star, one intentional snowflake)
dim_calendar─┐ dim_customers─┤ dim_products─┼──>fact_sales dim_stores─┤│ dim_regions─┘│ ↑ direct region_key FK for "revenue by region" queries
-- Five FK validations run in transform.py before fact load -- DB enforces FK RESTRICT + 7 CHECK constraints + 13 indexes
04
Feature Engineering
From raw values to analytics-ready features
The cleaning stage is not just type casting and null handling. Each table gets
domain-specific engineered features that make downstream BI work meaningful, not mechanical.
Customers
Tenure metrics: days, months, years from join date
Customer segment derived from tenure thresholds
Standardised gender and demographic fields
Surrogate customer_key for the dim table
Calendar
Renamed and re-parsed date columns for safe joins
Quarter, season, day type derived from date
Formatted date strings for display layers
Surrogate calendar_key for FK joins
Products
Standardised text fields and categorical values
Brand tier classification logic
Backfilled missing product records (P0000, P0201)
Surrogate product_key for the dim table
Stores
Country-to-region mapping (6 continental regions)
Region tier classification by market
Region promoted to its own dim during transform
Surrogate store_key + region_key FK
Sales
Coerced numeric metrics (revenue, cost, margin)
Revenue buckets and margin buckets
Outlier flag for unusually large transactions
Time of day categorisation from order timestamp
Cross-cutting
Foreign key validation before fact construction
Indexes on dim natural keys and fact FKs
Indexes on analytical filters (revenue_bucket, time_of_day)
FK constraints enforced at the DDL level
05
Key Decisions
Design choices & tradeoffs
Three explicit schemas
Raw, operational, and analytics each have their own schema rather than collapsing
into one. This makes lineage obvious, makes auditing trivial, and lets each layer be wiped or rebuilt
independently. Slightly more storage cost, much clearer mental model.
Wipe-then-load
Each pipeline run drops and recreates all schemas before loading. No upsert logic,
no partial-load bugs, fully deterministic output. Trade-off: not suitable for very large datasets, but ideal
for project-scale reproducibility and CI testing.
Positional surrogate keys
Every dim gets its surrogate key from
df.sort_values(natural_id).reset_index(drop=True).index + 1. Deterministic for the same input,
simple, fast. Trade-off: keys shift if new rows are added, so this only works because the warehouse is
rebuilt from scratch every run (wipe-and-load). For a warehouse that merged data over time, content-based
hashing (e.g. SHA-256 of natural keys) would be the right call instead.
Feature engineering in cleaning
Engineered features (tenure, segments, buckets, time of day) are computed in the
cleaning stage, not in BI tools. Analysts get clean, semantically meaningful columns from day one, and the
logic lives in version-controlled Python rather than SQL workbooks.
Modular loaders
Each schema has its own loader module (load_raw_schema, load_operationals_schema,
load_analytics_schema). Each one can be run independently for debugging, and the orchestrator just composes
them. No god-script.
FK validation before fact load
Before loading fact_sales, the transform stage runs five separate
validate_foreign_keys() checks: sales→customers, sales→products,
sales→stores, sales→regions, sales→calendar. If any fails, the
pipeline halts with sys.exit(1) and a structured error. Catches referential integrity issues
before they reach the warehouse.
DB constraints, not just types
The analytics schema (sql/analytics_schema.sql) defines five
FOREIGN KEY ... ON DELETE RESTRICT constraints, seven CHECK
constraints (age ≥ 0, quantity ≥ 0, order_date ≤ CURRENT_DATE, prices ≥ 0), UNIQUE
indexes on every natural ID, and 13 indexes across the fact and dims. PostgreSQL
enforces the integrity, not Python. Try to delete a dimension row that's still referenced by a fact and
Postgres refuses. The warehouse defends itself.
Region denormalised for query speed
The schema is mostly a pure star, with one intentional snowflake:
fact_sales has a direct region_key FK to dim_regions, AND
dim_stores also has a region_key FK to the same dim. The most common analytical
query against this warehouse is "revenue by region", and the direct fact-to-region path saves an extra join.
The full normalised path via dim_stores is still there for queries that need other store
attributes. Documented snowflake, not accidental snowflake.
Production-grade logging
The logging system is a deliberate engineering layer, not an afterthought.
Cross-platform UTF-8 detection (Windows Terminal, VS Code, PowerShell 7+, macOS/Linux) decides whether the
console can render emojis; dual handlers route colorized output to the terminal and clean UTF-8 to a
rotating file (no ANSI codes in logs); a section() helper visually separates ETL stages; a
timed() decorator logs how long each stage takes; SQLAlchemy engine noise is suppressed.
Result: observability that holds up at 2am, not just on the demo screen.
06
Outcomes
What the platform enables
3 schemas
Clean separation between raw, operational, and analytics layers
10+ feats
Engineered features available to BI on day one (segments, buckets, tiers)
1 cmd
Full pipeline runs end-to-end with python -m etl.run_all
5 tbls
Analytics-ready dim + fact tables, FK-validated and indexed
What this warehouse can answer
The dimensional
model directly supports the kind of questions a sales or growth team would ask of a chocolate-retail dataset.
Each one is a single SQL query against the analytics schema, not a multi-step pivot in a spreadsheet.
Revenue analysis
Revenue by product, store, or region
Margin analysis with bucket segmentation
Outlier flagging for unusually large transactions
Customer behaviour
Customer segmentation by tenure
Cohort comparison across joining periods
Repeat-buyer identification via fact_sales
Time & seasonality
Quarter and season-based performance
Day-type and time-of-day sales patterns
Year-over-year trend analysis via dim_calendar
07
Tech Stack
The tools, and what each one does here.
Python 3.xETL + transforms orchestration
pandasCleaning + feature engineering
PostgreSQL 18Three-schema warehouse
SQLAlchemyDB engine + FK validation
psycopg2Postgres driver
SQL (DDL)Layered schema design
python-dotenvEnv-var config loader
Custom loggerColorFormatter + structured logs
Kaggle CLISource data download
If I were scaling this up
For production scale the natural evolutions would be: dbt on top of the analytics schema for
analyst-authored transformations with tested SQL models; Airflow or Prefect orchestration for
scheduled refreshes with retry and alerting; materialised views for the most-queried analytics
aggregates; SCD Type 2 dimensions with valid_from / valid_to tracking
for time-aware product and category history; incremental loads against the operational layer
instead of full refreshes; and migration tooling (Alembic, Flyway) for evolving the
three-schema DDL safely.
Let's talk
Need a data
engineer who can layer a warehouse properly?
I'm currently looking for Data Engineer / Analytics Engineer roles. UK-based, open to remote or hybrid.