Back to projects

Case Study

Evidence Collection Crawler

A high-volume data collection pipeline that gathered about 19TB of video data over two months for AI illegal-content classification training.

Project Overview

The evidence collection crawler was built for special value-added telecommunications operators to collect training data for AI illegal-content classification. It used Python and Selenium to handle heterogeneous sites, JavaScript-rendered pages, long-running download sessions, retry and resume logic, and MySQL-based collection metadata.

Key Challenges

  • Handling different site structures, schemas, and blocking policies
  • Keeping long-running video download sessions stable
  • Minimizing missing data for AI model training needs
  • Separating large media storage from collection metadata

Key Outcomes

  • Collected about 19TB of video data over two months
  • Operated a long-running crawler without interruption
  • Built site-specific adapters for heterogeneous collection targets
  • Tracked collection status with MySQL metadata

Technologies

PythonSeleniumMySQLGitData Pipeline