A production-style web scraping pipeline that scrapes 60 pages of AliExpress laptop listings,
validates and enriches the data with discount metrics and price bands, and appends only new records to a
PostgreSQL warehouse. Modular, idempotent at the load layer, observable, and ready to rerun.
Live web scrapingSelenium + BeautifulSoup60 pages per runPostgreSQL warehouseAppend-only loads
E-commerce pages are not built to be scraped. They lazy-load content, show cookie banners
on every visit, detect automated browsers, and change their HTML layout without warning. The data you eventually
pull is raw text full of inconsistencies. A pipeline that works once is easy. A pipeline that works repeatedly,
against a live site, is the harder problem.
Scraping challenges
Lazy-loaded listings that need explicit waits
Cookie banners blocking content on every visit
Bot detection flags on default browser automation
Pagination behaviour that varies between page loads
Inconsistent HTML structure across listing variants
Data quality challenges
Prices stored as strings with currency symbols
Missing original prices on non-discounted items
Duplicate listings across consecutive page loads
No surrogate keys, only listing text as identity
Mixed casing and inconsistent shipping labels
02
Architecture
Four-stage pipeline
The pipeline runs end to end with a single command. Each stage is an isolated module that
can also run independently. A DataFrame is passed in memory between extract, transform, and load (no
intermediate file unless requested), keeping the pipeline fast and the failure surface predictable.
Step 01
Health Check
HTTP + browser test before scraping
scraper.py
Step 02
Extract
Scrape 60 pages, parse HTML, return DataFrame
extract.py
Step 03
Transform
Clean prices, dedupe, engineer features, validate
transform.py
Step 04
Load
Dedupe vs existing, append only new rows
load.py
Step 05
Orchestrate
Single command, in-memory df passing, timed
run_all.py
# Run the full pipeline end-to-end python-metl.run_all
# What run_all.py executes: 1.scraper.health_check()→ HTTP request
+ real browser test 2.extract.scrape()→ 60 pages, returns
DataFrame 3.transform.clean(df)→ type cast,
dedupe, engineer features 4.load.append(df)→ dedupe vs DB, append
only new rows
# Every step: timed, logged via custom ColorFormatter, fail-fast
03
Scraping Layer
Reliable scraping, hostile target
The extraction layer is the most fragile part of any web scraping pipeline. I treated it
as a system with explicit reliability requirements, not a script to be rerun until it works.
Health check first
Before scraping, scraper.py sends an HTTP request to AliExpress and opens a real Chrome
browser. If either fails, the pipeline aborts with a clear error instead of producing garbage output.
Bot detection avoidance
Selenium is configured to disable common automation flags. The browser opens looking like a real user
agent, not a default Chromedriver instance.
Cookie banner auto-accept
The scraper waits for the cookie banner to render, dismisses it, then continues. No human intervention, no
scraped HTML hidden behind a modal.
Explicit waits, not sleep
Every page transition uses WebDriverWait against a specific element selector. No
time.sleep() guesses. The scraper waits exactly as long as needed and no longer.
BeautifulSoup for parsing
Selenium handles navigation. BeautifulSoup handles parsing. Each tool does what it is best at, and the
parsing logic is unit-testable without spinning up a browser.
Pagination across 60 pages
The scraper walks through all 60 pages of laptop listings in a single run, with retry logic at each page
transition. Failure on page 42 does not corrupt the data from pages 1 through 41.
04
Feature Engineering
From raw scrape to analytics-ready
The transform stage does more than clean data. It engineers business-friendly features
that turn raw scraped text into something a BI tool or SQL query can answer questions against.
Source
columns scraped
Column
Description
product_name
Full listing title
product_price
Current sale price (£)
was_price
Original crossed-out price (£)
discount_info
Discount label from the listing
extra_discount
Additional coupon or offer
shipping_status
Free shipping or not
total_sold_count
Number of units sold
top_selling_status
Top selling on AliExpress flag
scraped_at
Timestamp of scrape
Features engineered in transform
Column
Logic
discount_amount
was_price − product_price in £
discount_percentage
% saved off the original price
is_on_sale
Boolean. Was-price exists and is higher than current price
is_free_shipping
Boolean. Free shipping offered on the listing
is_top_seller
Boolean. Top-selling badge present on the listing
price_band
Budget / Mid-Range / Upper-Mid / Premium / Luxury
-- Example queries the warehouse can answer:
-- Which laptops offer the best discount? SELECTproduct_name, product_price, discount_percentage FROMlaptops WHEREis_on_sale =TRUE ORDER BYdiscount_percentageDESCLIMIT10;
-- What price band has the most listings? SELECTprice_band,COUNT(*) AStotal FROMlaptops GROUP BYprice_band ORDER BYtotalDESC;
05
Key Decisions
Design choices & tradeoffs
Selenium + BeautifulSoup
Selenium handles dynamic JavaScript rendering and navigation. BeautifulSoup handles
parsing the resulting HTML. Each tool stays in its lane. The parsing layer is unit-testable without
launching a browser, which is the difference between an hour of debugging and ten seconds of pytest.
Append-only load
Unlike PayFlow and ChocoDelight, which wipe and reload, this pipeline appends.
Scraping a live source is non-idempotent at the source layer (data changes between runs), so the load step
deduplicates against existing rows before inserting. New listings get added, old data stays untouched, no
row is ever duplicated.
In-memory df passing
Extract, transform, and load pass a DataFrame directly between stages instead of
always writing CSVs to disk between steps. This keeps the orchestrated pipeline fast. Each step can still be
run independently and writes its CSV output if you need to inspect intermediate state.
Explicit waits, never sleep
time.sleep() is a guess. WebDriverWait is a
contract. Every page interaction waits for a specific element to be present or interactable, not for a
hardcoded number of seconds. Scraper is faster on fast networks, more reliable on slow ones, and never flaky
for reasons that should not matter.
Health check before scrape
Before extract.py runs, scraper.py confirms AliExpress is reachable via HTTP and
Chrome can launch successfully. If either fails, the pipeline aborts before scraping a single page. Saves
time on broken environments and surfaces config issues clearly.
Feature engineering in Python
Discount percentage, price bands, and boolean flags are computed in the transform
stage in pandas, not as derived columns in SQL views. Logic lives in version-controlled Python where it can
be tested, reviewed, and changed in one place. Downstream analytics see clean, semantically meaningful
columns.
06
Outcomes
What this pipeline enables
60 pages
Scraped end-to-end in a single run with retry logic at every page transition
6 features
Engineered in the transform stage (discount %, price bands, three boolean flags)
1 cmd
Full pipeline runs end-to-end with python -m etl.run_all
0 dupes
Append-only loads with pre-insert deduplication. Reruns never corrupt the warehouse
What this warehouse can answer
Each engineered
feature directly enables a category of analytical question. Below are real business questions a
price-intelligence team or competitive-analysis team would ask of this data, all answerable with a single SQL
query.
Pricing intelligence
Best-discount laptops right now
Average price by price band
Discount distribution across the catalog
Listing patterns
Share of listings on sale at any time
Free-shipping vs paid-shipping price differential
Top-seller flag correlation with discount depth
Trend tracking
Price changes over time per listing
Listing churn (new vs returning items)
Long-running discounts vs flash sales
07
Tech Stack
The tools, and what each one does here.
Python 3.10+ETL + scraping orchestration
Selenium WebDriverBrowser automation + pagination
BeautifulSoupHTML parsing + element extraction
pandasCleaning + feature engineering
PostgreSQL 18Append-only warehouse
SQLAlchemyDB engine + bulk loads
psycopg2Postgres driver
Custom loggerCross-platform ColorFormatter
python-dotenvEnv-var config loader
If I were scaling this up
For higher volumes and reliability the natural evolutions would be: Playwright instead of
Selenium for faster, more resilient browser automation; proxy rotation to avoid rate-limiting
on long-running scrapes; asynchronous fetching (asyncio + aiohttp) for static endpoints to
parallelise page retrieval; Airflow or Prefect orchestration for scheduled runs with retry and
alerting; change-detection logic to flag price drops or stock changes per listing rather than
just appending new rows; and migration tooling (Alembic, Flyway) for evolving the schema safely
as scraped attributes change.
Let's talk
Need a data
engineer who can build reliable scraping pipelines?
I'm currently looking for Data Engineer / Analytics Engineer roles. UK-based, open to remote or hybrid.