Multiple Clients·2 months·2022

AutoScheduled Web Scrapers

ImpactAutomated data collection

InfrastructureGKE

SchedulingCron Jobs

APIREST

Overview

Developed a scalable web scraping infrastructure for multiple clients requiring regular data extraction from various websites. The system runs on Google Kubernetes Engine with automated scheduling and monitoring.

The Challenge

Clients needed reliable, scheduled data extraction from multiple websites with varying structures and anti-scraping measures.

The Solution

Built dockerized Selenium scrapers with REST API interface using Flask, deployed on Google Kubernetes Engine with cron job scheduling and persistent volume storage for data retention.

Key Results

Reliable scheduled scraping across multiple sites
Containerized deployment on GKE
REST API for on-demand scraping

Tech Stack

SeleniumDockerFlaskGoogle Kubernetes EnginePostgreSQLCron Jobs

Overview

The Challenge

The Solution

Key Results

Tech Stack

Categories