Back to services

Service

Web Crawling and Data Collection Systems

Build data collection pipelines for dynamic pages, large files, and long-running jobs.

Beyond simple scraping, collection systems need retries, resume logic, status tracking, storage separation, and monitoring.

Deliverables

  • Dynamic page collection with Selenium and Puppeteer
  • Collection status, failure reasons, and retry queue management
  • Separation of large file storage and metadata
  • Operational logs, collection reports, and admin review flows

Expected Outcomes

  • Handle heterogeneous site structures
  • Improve stability for long-running collection jobs
  • Reduce missing and duplicated collection data

Core Technologies

PythonSeleniumPuppeteerMySQLData Pipeline

Related Projects

Delivered

Evidence Collection Crawler

Evidence Collection Crawler

A high-volume data collection pipeline that gathered about 19TB of video data over two months for AI illegal-content classification training.

Proof

Collected about 19TB of video data over two months

PythonSeleniumMySQLGitData Pipeline
Status: Delivered
Read Case Study

Live

Tech Collection

Tech Collection

A live AI curation blog aggregator that crawls technical blogs, summarizes and classifies posts with ChatGPT, and improved TPS by 15x through caching.

Proof

Built an automated crawl, summarize, classify, and serve pipeline

Spring Boot 3PuppeteerOpenAI APIMySQLGCP Cloud RunCloud Storage+1
MVNO Hub Renewal project screen

MVNO Hub Renewal

A public-service renewal project that delivered telecom order integration APIs, order-status Kakao AlimTalk notifications, and a 32x Oracle query improvement.

Proof

Delivered real-time order integration APIs for external telecom operators

JavaSpringOracleRESTful APIKakao AlimTalkGit
Status: Delivered
Visit
Read Case Study