All Projects
Multiple Clients·2 months·2022

AutoScheduled Web Scrapers

ImpactAutomated data collection
InfrastructureGKE
SchedulingCron Jobs
APIREST
AutoScheduled Web Scrapers

Overview

Developed a scalable web scraping infrastructure for multiple clients requiring regular data extraction from various websites. The system runs on Google Kubernetes Engine with automated scheduling and monitoring.

The Challenge

Clients needed reliable, scheduled data extraction from multiple websites with varying structures and anti-scraping measures.

The Solution

Built dockerized Selenium scrapers with REST API interface using Flask, deployed on Google Kubernetes Engine with cron job scheduling and persistent volume storage for data retention.

Key Results

  • Reliable scheduled scraping across multiple sites

  • Containerized deployment on GKE

  • REST API for on-demand scraping

Tech Stack

SeleniumDockerFlaskGoogle Kubernetes EnginePostgreSQLCron Jobs

Categories

SeleniumDockerGKEFlask