Baidu Encyclopedia crawling tutorial

Baidu Encyclopedia currently has 16,330,473 entries

Here is a scrapy-based distributed Baidu Baike crawler that can crawl all Baidu Baike entries

github address

characteristic

  • Encyclopedia site entry crawling, including Baidu Encyclopedia, Interactive Encyclopedia, wiki Chinese and English sites;
  • Support breakpoints to continue climbing;
  • Support caching encyclopedia entry pages;
  • Distributed deployment;
  • After a stand-alone test, under the i9-9900K memory 64G 100M network bandwidth, Baidu Encyclopedia entries can grab about 50w entries a day (under the default system configuration); the interactive encyclopedia test results are
    similar, and the wiki website has a small amount of data captured and is subject to configuration The agency delay has a greater impact;

how to use

  • Installation dependencies pip install -r requirement.txt
  • Initial database python initialize_db.py
  • Initialize the crawler seed python initialize_tasks_seeds.py
  • Start running crawler python start_spiders.py

Distributed use

  • Single machine, multiple runs python start_spiders.py
  • Multiple machines, configure redis and mysql servers, run multiple timespython start_spiders.py

common problem

  • In theory, as long as the seeds you give are comprehensive enough, you can grab as many entries as possible
  • Seed link Extraction code: iagw from Baidu Encyclopedia 2012 dump

Known BUG

  • Under multi-process crawling, Redis memory will overflow (64G). Currently, it has been changed to 1.5T memory and has not encountered BUG. After running for three hours, the used memory is as high as 38G.

redis monitoring

Welcome star!

Guess you like

Origin blog.csdn.net/u013741019/article/details/102882731