AutoScheduled Web Scrapers
ImpactAutomated data collection
InfrastructureGKE
SchedulingCron Jobs
APIREST

Overview
Developed a scalable web scraping infrastructure for multiple clients requiring regular data extraction from various websites. The system runs on Google Kubernetes Engine with automated scheduling and monitoring.
The Challenge
Clients needed reliable, scheduled data extraction from multiple websites with varying structures and anti-scraping measures.
The Solution
Built dockerized Selenium scrapers with REST API interface using Flask, deployed on Google Kubernetes Engine with cron job scheduling and persistent volume storage for data retention.
Key Results
Reliable scheduled scraping across multiple sites
Containerized deployment on GKE
REST API for on-demand scraping
Tech Stack
SeleniumDockerFlaskGoogle Kubernetes EnginePostgreSQLCron Jobs