Service

Web Crawling and Data Collection Systems

Build data collection pipelines for dynamic pages, large files, and long-running jobs.

Beyond simple scraping, collection systems need retries, resume logic, status tracking, storage separation, and monitoring.

Deliverables

Dynamic page collection with Selenium and Puppeteer
Collection status, failure reasons, and retry queue management
Separation of large file storage and metadata
Operational logs, collection reports, and admin review flows

Expected Outcomes

Handle heterogeneous site structures
Improve stability for long-running collection jobs
Reduce missing and duplicated collection data

Core Technologies

PythonSeleniumPuppeteerMySQLData Pipeline

Related Projects

Delivered

Evidence Collection Crawler

A high-volume data collection pipeline that gathered about 19TB of video data over two months for AI illegal-content classification training.

Proof

Collected about 19TB of video data over two months

PythonSeleniumMySQLGitData Pipeline

Status: Delivered

Read Case Study

Live

Tech Collection

A live AI curation blog aggregator that crawls technical blogs, summarizes and classifies posts with ChatGPT, and improved TPS by 15x through caching.

Proof

Built an automated crawl, summarize, classify, and serve pipeline

Spring Boot 3PuppeteerOpenAI APIMySQLGCP Cloud RunCloud Storage+1

Status: Live

Visit

Read Case Study

MVNO Hub Renewal

A public-service renewal project that delivered telecom order integration APIs, order-status Kakao AlimTalk notifications, and a 32x Oracle query improvement.

Proof

Delivered real-time order integration APIs for external telecom operators

JavaSpringOracleRESTful APIKakao AlimTalkGit

Status: Delivered

Visit

Read Case Study