These are not borrowed from blog posts or framework documentation. They are positions I have taken often enough that they feel less like opinions and more like defaults. Each one comes with a justification, and where it matters, the trade-off I accept by holding it.
I define warehouse tables in explicit SQL DDL files, not generated from pandas dtypes or ORM models. A schema written in version-controlled SQL is auditable in pull requests, readable by analysts who do not write Python, and portable across environments without a runtime.
The temptation to let Python infer types and create tables on the fly is real, especially early in a project. It feels faster. It is also the moment a warehouse stops being a contract and starts being whatever Python decided last Tuesday.
A pipeline that produces different output on rerun is a pipeline that produces production incidents. I default to wipe-and-reload over upsert logic until scale forces otherwise, because deterministic output is worth the storage cost.
Partial-load bugs are the worst kind to debug. They look fine until someone notices a metric drifted by 3% over a week. Idempotent pipelines fail loudly when something is wrong. Non-idempotent pipelines fail quietly, in the data, where you only find them after the dashboard has been showing the wrong number for a month.
Every pipeline stage gets a timing decorator and a structured log line at completion. Row counts in, row counts out, elapsed time, validation summaries. The cost of writing one extra log line is microseconds. The cost of not having it during an incident is hours of guessing.
Good logs read like a flight recorder. They tell you exactly what happened, in order, with enough context that you do not need to rerun anything to reconstruct the incident. Bad logs say "something went wrong" and leave you opening the database to check the row counts yourself.
Schema drift, null values in critical columns, foreign key mismatches. These belong in a validation stage that runs before any data reaches the warehouse, not as a SELECT query you remember to run after a load looks suspicious.
I add validation at extract, clean, and transform stages. Row counts logged. Null counts logged. Foreign key references checked against dimension tables before the fact table is built. The cost is a few seconds of pipeline time. The benefit is that bad data never gets the chance to corrupt a downstream table.
Database URLs, credentials, source paths, environment flags. None of these belong as string literals in Python files, and none belong in a central config module that gets committed to the repo. They belong in a .env file, loaded at startup, never logged, and never committed.
This is not a security feature, although it is also that. It is a portability feature. A pipeline that reads its config from environment variables runs on my laptop, in CI, in staging, and in production with no code changes. A pipeline with hardcoded values runs in exactly one place and breaks in every other.
Each stage of a pipeline (extract, clean, transform, load) is its own module with a single responsibility. The orchestrator composes them. This means I can run any stage independently for debugging, replace any stage without touching the others, and reason about each stage in isolation.
The opposite pattern (a single 800-line script that does everything) is faster to write and impossible to maintain. I have inherited those scripts. I will never write one.
Customer segments, revenue buckets, product tiers, time-of-day categories. These belong in the analytics schema as columns on the dim or fact table, computed once during the transform stage. Not as CASE statements in BI tools, not as calculated fields in Tableau, not as SQL written fresh in every dashboard.
When the business logic for a segment changes, I want to update it in one Python file and have every dashboard automatically reflect the change. Not chase down 12 instances of the same CASE statement scattered across BI tools.
All four case studies on this site demonstrate these principles applied to production code, with the featured FibbieBanks project showing every one of the seven in action.