~6 min read · Updated May 2026

Case Study · Live Web Scraping & ETL

Live data source

AliExpress Laptop ETL Pipeline
& PostgreSQL warehouse

A production-style web scraping pipeline that scrapes 60 pages of AliExpress laptop listings, validates and enriches the data with discount metrics and price bands, and appends only new records to a PostgreSQL warehouse. Modular, idempotent at the load layer, observable, and ready to rerun.

Live web scraping Selenium + BeautifulSoup 60 pages per run PostgreSQL warehouse Append-only loads

pages scraped

source columns

engineered features

ETL stages

command to run

View on GitHub Read the story

The Problem

Live, hostile, unstructured

E-commerce pages are not built to be scraped. They lazy-load content, show cookie banners on every visit, detect automated browsers, and change their HTML layout without warning. The data you eventually pull is raw text full of inconsistencies. A pipeline that works once is easy. A pipeline that works repeatedly, against a live site, is the harder problem.

Scraping challenges

Lazy-loaded listings that need explicit waits
Cookie banners blocking content on every visit
Bot detection flags on default browser automation
Pagination behaviour that varies between page loads
Inconsistent HTML structure across listing variants

Data quality challenges

Prices stored as strings with currency symbols
Missing original prices on non-discounted items
Duplicate listings across consecutive page loads
No surrogate keys, only listing text as identity
Mixed casing and inconsistent shipping labels

Architecture

Four-stage pipeline

The pipeline runs end to end with a single command. Each stage is an isolated module that can also run independently. A DataFrame is passed in memory between extract, transform, and load (no intermediate file unless requested), keeping the pipeline fast and the failure surface predictable.

Step 01

Health Check

HTTP + browser test before scraping

scraper.py

Step 02

Extract

Scrape 60 pages, parse HTML, return DataFrame

extract.py

Step 03

Transform

Clean prices, dedupe, engineer features, validate

transform.py

Step 04

Load

Dedupe vs existing, append only new rows

load.py

Step 05

Orchestrate

Single command, in-memory df passing, timed

run_all.py

# Run the full pipeline end-to-end

python -m etl.run_all

# What run_all.py executes:

1. scraper.health_check()     →  HTTP request + real browser test

2. extract.scrape()            →  60 pages, returns DataFrame

3. transform.clean(df)          →  type cast, dedupe, engineer features

4. load.append(df)              →  dedupe vs DB, append only new rows

# Every step: timed, logged via custom ColorFormatter, fail-fast

Scraping Layer

Reliable scraping, hostile target

The extraction layer is the most fragile part of any web scraping pipeline. I treated it as a system with explicit reliability requirements, not a script to be rerun until it works.

Health check first

Before scraping, scraper.py sends an HTTP request to AliExpress and opens a real Chrome browser. If either fails, the pipeline aborts with a clear error instead of producing garbage output.

Bot detection avoidance

Selenium is configured to disable common automation flags. The browser opens looking like a real user agent, not a default Chromedriver instance.

Cookie banner auto-accept

The scraper waits for the cookie banner to render, dismisses it, then continues. No human intervention, no scraped HTML hidden behind a modal.

Explicit waits, not sleep

Every page transition uses WebDriverWait against a specific element selector. No time.sleep() guesses. The scraper waits exactly as long as needed and no longer.

BeautifulSoup for parsing

Selenium handles navigation. BeautifulSoup handles parsing. Each tool does what it is best at, and the parsing logic is unit-testable without spinning up a browser.

Pagination across 60 pages

The scraper walks through all 60 pages of laptop listings in a single run, with retry logic at each page transition. Failure on page 42 does not corrupt the data from pages 1 through 41.

Feature Engineering

From raw scrape to analytics-ready

The transform stage does more than clean data. It engineers business-friendly features that turn raw scraped text into something a BI tool or SQL query can answer questions against.

Source columns scraped

Column	Description
product_name	Full listing title
product_price	Current sale price (£)
was_price	Original crossed-out price (£)
discount_info	Discount label from the listing
extra_discount	Additional coupon or offer
shipping_status	Free shipping or not
total_sold_count	Number of units sold
top_selling_status	Top selling on AliExpress flag
scraped_at	Timestamp of scrape

Features engineered in transform

Column	Logic
discount_amount	was_price − product_price in £
discount_percentage	% saved off the original price
is_on_sale	Boolean. Was-price exists and is higher than current price
is_free_shipping	Boolean. Free shipping offered on the listing
is_top_seller	Boolean. Top-selling badge present on the listing
price_band	Budget / Mid-Range / Upper-Mid / Premium / Luxury

-- Example queries the warehouse can answer:

-- Which laptops offer the best discount?

SELECT product_name, product_price, discount_percentage

FROM laptops

WHERE is_on_sale = TRUE

ORDER BY discount_percentage DESC LIMIT 10;

-- What price band has the most listings?

SELECT price_band, COUNT(*) AS total

FROM laptops

GROUP BY price_band

ORDER BY total DESC;

Key Decisions

Design choices & tradeoffs

Selenium + BeautifulSoup

Selenium handles dynamic JavaScript rendering and navigation. BeautifulSoup handles parsing the resulting HTML. Each tool stays in its lane. The parsing layer is unit-testable without launching a browser, which is the difference between an hour of debugging and ten seconds of pytest.

Append-only load

Unlike PayFlow and ChocoDelight, which wipe and reload, this pipeline appends. Scraping a live source is non-idempotent at the source layer (data changes between runs), so the load step deduplicates against existing rows before inserting. New listings get added, old data stays untouched, no row is ever duplicated.

In-memory df passing

Extract, transform, and load pass a DataFrame directly between stages instead of always writing CSVs to disk between steps. This keeps the orchestrated pipeline fast. Each step can still be run independently and writes its CSV output if you need to inspect intermediate state.

Explicit waits, never sleep

time.sleep() is a guess. WebDriverWait is a contract. Every page interaction waits for a specific element to be present or interactable, not for a hardcoded number of seconds. Scraper is faster on fast networks, more reliable on slow ones, and never flaky for reasons that should not matter.

Health check before scrape

Before extract.py runs, scraper.py confirms AliExpress is reachable via HTTP and Chrome can launch successfully. If either fails, the pipeline aborts before scraping a single page. Saves time on broken environments and surfaces config issues clearly.

Feature engineering in Python

Discount percentage, price bands, and boolean flags are computed in the transform stage in pandas, not as derived columns in SQL views. Logic lives in version-controlled Python where it can be tested, reviewed, and changed in one place. Downstream analytics see clean, semantically meaningful columns.

Outcomes

What this pipeline enables

60 pages

Scraped end-to-end in a single run with retry logic at every page transition

6 features

Engineered in the transform stage (discount %, price bands, three boolean flags)

1 cmd

Full pipeline runs end-to-end with python -m etl.run_all

0 dupes

Append-only loads with pre-insert deduplication. Reruns never corrupt the warehouse

What this warehouse can answer

Each engineered feature directly enables a category of analytical question. Below are real business questions a price-intelligence team or competitive-analysis team would ask of this data, all answerable with a single SQL query.

Pricing intelligence

Best-discount laptops right now
Average price by price band
Discount distribution across the catalog

Listing patterns

Share of listings on sale at any time
Free-shipping vs paid-shipping price differential
Top-seller flag correlation with discount depth

Trend tracking

Price changes over time per listing
Listing churn (new vs returning items)
Long-running discounts vs flash sales

Tech Stack

The tools, and what each one does here.

Python 3.10+ ETL + scraping orchestration

Selenium WebDriver Browser automation + pagination

BeautifulSoup HTML parsing + element extraction

pandas Cleaning + feature engineering

PostgreSQL 18 Append-only warehouse

SQLAlchemy DB engine + bulk loads

psycopg2 Postgres driver

Custom logger Cross-platform ColorFormatter

python-dotenv Env-var config loader

If I were scaling this up

For higher volumes and reliability the natural evolutions would be: Playwright instead of Selenium for faster, more resilient browser automation; proxy rotation to avoid rate-limiting on long-running scrapes; asynchronous fetching (asyncio + aiohttp) for static endpoints to parallelise page retrieval; Airflow or Prefect orchestration for scheduled runs with retry and alerting; change-detection logic to flag price drops or stock changes per listing rather than just appending new rows; and migration tooling (Alembic, Flyway) for evolving the schema safely as scraped attributes change.

Let's talk

Need a data engineer who can build reliable scraping pipelines?

I'm currently looking for Data Engineer / Analytics Engineer roles. UK-based, open to remote or hybrid.

View this project on GitHub Get in touch

Other case studies

Case study · Featured · Distributed ETL

FibbieBanks ETL & Warehouse

A 1M-row PySpark pipeline for synthetic banking transactions, with a validated star schema warehouse, deterministic SHA-256 surrogate keys, and idempotent re-runs on PostgreSQL.

PySpark PostgreSQL SQLAlchemy

Read the case study→

Case study · End-to-end ETL

PayFlow — ETL & Warehouse

A production-style ETL pipeline that loads 9 raw Brazilian e-commerce CSVs into a full PostgreSQL star schema, with idempotent loads and structured logging.

Python PostgreSQL SQLAlchemy