Case Study

Evidence Collection Crawler

A high-volume data collection pipeline that gathered about 19TB of video data over two months for AI illegal-content classification training.

Project Overview

The evidence collection crawler was built for special value-added telecommunications operators to collect training data for AI illegal-content classification. It used Python and Selenium to handle heterogeneous sites, JavaScript-rendered pages, long-running download sessions, retry and resume logic, and MySQL-based collection metadata.

Key Challenges

Handling different site structures, schemas, and blocking policies
Keeping long-running video download sessions stable
Minimizing missing data for AI model training needs
Separating large media storage from collection metadata

Key Outcomes

Collected about 19TB of video data over two months
Operated a long-running crawler without interruption
Built site-specific adapters for heterogeneous collection targets
Tracked collection status with MySQL metadata

Technologies

PythonSeleniumMySQLGitData Pipeline