Back to portfolio
~6 min read · Updated May 2026
Case Study · Live Web Scraping & ETL
Live data source

AliExpress Laptop ETL Pipeline
& PostgreSQL warehouse

A production-style web scraping pipeline that scrapes 60 pages of AliExpress laptop listings, validates and enriches the data with discount metrics and price bands, and appends only new records to a PostgreSQL warehouse. Modular, idempotent at the load layer, observable, and ready to rerun.

Live web scraping Selenium + BeautifulSoup 60 pages per run PostgreSQL warehouse Append-only loads
60
pages scraped
9
source columns
6
engineered features
4
ETL stages
1
command to run
01
The Problem

Live, hostile, unstructured

E-commerce pages are not built to be scraped. They lazy-load content, show cookie banners on every visit, detect automated browsers, and change their HTML layout without warning. The data you eventually pull is raw text full of inconsistencies. A pipeline that works once is easy. A pipeline that works repeatedly, against a live site, is the harder problem.

Scraping challenges

  • Lazy-loaded listings that need explicit waits
  • Cookie banners blocking content on every visit
  • Bot detection flags on default browser automation
  • Pagination behaviour that varies between page loads
  • Inconsistent HTML structure across listing variants

Data quality challenges

  • Prices stored as strings with currency symbols
  • Missing original prices on non-discounted items
  • Duplicate listings across consecutive page loads
  • No surrogate keys, only listing text as identity
  • Mixed casing and inconsistent shipping labels
02
Architecture

Four-stage pipeline

The pipeline runs end to end with a single command. Each stage is an isolated module that can also run independently. A DataFrame is passed in memory between extract, transform, and load (no intermediate file unless requested), keeping the pipeline fast and the failure surface predictable.

Step 01
Health Check
HTTP + browser test before scraping
scraper.py
Step 02
Extract
Scrape 60 pages, parse HTML, return DataFrame
extract.py
Step 03
Transform
Clean prices, dedupe, engineer features, validate
transform.py
Step 04
Load
Dedupe vs existing, append only new rows
load.py
Step 05
Orchestrate
Single command, in-memory df passing, timed
run_all.py
# Run the full pipeline end-to-end
python -m etl.run_all

# What run_all.py executes:
1. scraper.health_check() → HTTP request + real browser test
2. extract.scrape() → 60 pages, returns DataFrame
3. transform.clean(df) → type cast, dedupe, engineer features
4. load.append(df) → dedupe vs DB, append only new rows

# Every step: timed, logged via custom ColorFormatter, fail-fast
03
Scraping Layer

Reliable scraping, hostile target

The extraction layer is the most fragile part of any web scraping pipeline. I treated it as a system with explicit reliability requirements, not a script to be rerun until it works.

Health check first

Before scraping, scraper.py sends an HTTP request to AliExpress and opens a real Chrome browser. If either fails, the pipeline aborts with a clear error instead of producing garbage output.

Bot detection avoidance

Selenium is configured to disable common automation flags. The browser opens looking like a real user agent, not a default Chromedriver instance.

Cookie banner auto-accept

The scraper waits for the cookie banner to render, dismisses it, then continues. No human intervention, no scraped HTML hidden behind a modal.

Explicit waits, not sleep

Every page transition uses WebDriverWait against a specific element selector. No time.sleep() guesses. The scraper waits exactly as long as needed and no longer.

BeautifulSoup for parsing

Selenium handles navigation. BeautifulSoup handles parsing. Each tool does what it is best at, and the parsing logic is unit-testable without spinning up a browser.

Pagination across 60 pages

The scraper walks through all 60 pages of laptop listings in a single run, with retry logic at each page transition. Failure on page 42 does not corrupt the data from pages 1 through 41.

04
Feature Engineering

From raw scrape to analytics-ready

The transform stage does more than clean data. It engineers business-friendly features that turn raw scraped text into something a BI tool or SQL query can answer questions against.

Source columns scraped

Column Description
product_name Full listing title
product_price Current sale price (£)
was_price Original crossed-out price (£)
discount_info Discount label from the listing
extra_discount Additional coupon or offer
shipping_status Free shipping or not
total_sold_count Number of units sold
top_selling_status Top selling on AliExpress flag
scraped_at Timestamp of scrape

Features engineered in transform

Column Logic
discount_amount was_price − product_price in £
discount_percentage % saved off the original price
is_on_sale Boolean. Was-price exists and is higher than current price
is_free_shipping Boolean. Free shipping offered on the listing
is_top_seller Boolean. Top-selling badge present on the listing
price_band Budget / Mid-Range / Upper-Mid / Premium / Luxury
-- Example queries the warehouse can answer:

-- Which laptops offer the best discount?
SELECT product_name, product_price, discount_percentage
FROM laptops
WHERE is_on_sale = TRUE
ORDER BY discount_percentage DESC LIMIT 10;

-- What price band has the most listings?
SELECT price_band, COUNT(*) AS total
FROM laptops
GROUP BY price_band
ORDER BY total DESC;
05
Key Decisions

Design choices & tradeoffs

Selenium + BeautifulSoup
Selenium handles dynamic JavaScript rendering and navigation. BeautifulSoup handles parsing the resulting HTML. Each tool stays in its lane. The parsing layer is unit-testable without launching a browser, which is the difference between an hour of debugging and ten seconds of pytest.
Append-only load
Unlike PayFlow and ChocoDelight, which wipe and reload, this pipeline appends. Scraping a live source is non-idempotent at the source layer (data changes between runs), so the load step deduplicates against existing rows before inserting. New listings get added, old data stays untouched, no row is ever duplicated.
In-memory df passing
Extract, transform, and load pass a DataFrame directly between stages instead of always writing CSVs to disk between steps. This keeps the orchestrated pipeline fast. Each step can still be run independently and writes its CSV output if you need to inspect intermediate state.
Explicit waits, never sleep
time.sleep() is a guess. WebDriverWait is a contract. Every page interaction waits for a specific element to be present or interactable, not for a hardcoded number of seconds. Scraper is faster on fast networks, more reliable on slow ones, and never flaky for reasons that should not matter.
Health check before scrape
Before extract.py runs, scraper.py confirms AliExpress is reachable via HTTP and Chrome can launch successfully. If either fails, the pipeline aborts before scraping a single page. Saves time on broken environments and surfaces config issues clearly.
Feature engineering in Python
Discount percentage, price bands, and boolean flags are computed in the transform stage in pandas, not as derived columns in SQL views. Logic lives in version-controlled Python where it can be tested, reviewed, and changed in one place. Downstream analytics see clean, semantically meaningful columns.
06
Outcomes

What this pipeline enables

60 pages
Scraped end-to-end in a single run with retry logic at every page transition
6 features
Engineered in the transform stage (discount %, price bands, three boolean flags)
1 cmd
Full pipeline runs end-to-end with python -m etl.run_all
0 dupes
Append-only loads with pre-insert deduplication. Reruns never corrupt the warehouse

What this warehouse can answer

Each engineered feature directly enables a category of analytical question. Below are real business questions a price-intelligence team or competitive-analysis team would ask of this data, all answerable with a single SQL query.

Pricing intelligence

  • Best-discount laptops right now
  • Average price by price band
  • Discount distribution across the catalog

Listing patterns

  • Share of listings on sale at any time
  • Free-shipping vs paid-shipping price differential
  • Top-seller flag correlation with discount depth

Trend tracking

  • Price changes over time per listing
  • Listing churn (new vs returning items)
  • Long-running discounts vs flash sales
07
Tech Stack

The tools, and what each one does here.

Python 3.10+ ETL + scraping orchestration
Selenium WebDriver Browser automation + pagination
BeautifulSoup HTML parsing + element extraction
pandas Cleaning + feature engineering
PostgreSQL 18 Append-only warehouse
SQLAlchemy DB engine + bulk loads
psycopg2 Postgres driver
Custom logger Cross-platform ColorFormatter
python-dotenv Env-var config loader

If I were scaling this up

For higher volumes and reliability the natural evolutions would be: Playwright instead of Selenium for faster, more resilient browser automation; proxy rotation to avoid rate-limiting on long-running scrapes; asynchronous fetching (asyncio + aiohttp) for static endpoints to parallelise page retrieval; Airflow or Prefect orchestration for scheduled runs with retry and alerting; change-detection logic to flag price drops or stock changes per listing rather than just appending new rows; and migration tooling (Alembic, Flyway) for evolving the schema safely as scraped attributes change.

Let's talk

Need a data engineer who can build reliable scraping pipelines?

I'm currently looking for Data Engineer / Analytics Engineer roles. UK-based, open to remote or hybrid.