Baidu Encyclopedia currently has 16,330,473 entries
Here is a scrapy-based distributed Baidu Baike crawler that can crawl all Baidu Baike entries
characteristic
- Encyclopedia site entry crawling, including Baidu Encyclopedia, Interactive Encyclopedia, wiki Chinese and English sites;
- Support breakpoints to continue climbing;
- Support caching encyclopedia entry pages;
- Distributed deployment;
- After a stand-alone test, under the i9-9900K memory 64G 100M network bandwidth, Baidu Encyclopedia entries can grab about 50w entries a day (under the default system configuration); the interactive encyclopedia test results are
similar, and the wiki website has a small amount of data captured and is subject to configuration The agency delay has a greater impact;
how to use
- Installation dependencies
pip install -r requirement.txt
- Initial database
python initialize_db.py
- Initialize the crawler seed
python initialize_tasks_seeds.py
- Start running crawler
python start_spiders.py
Distributed use
- Single machine, multiple runs
python start_spiders.py
- Multiple machines, configure redis and mysql servers, run multiple times
python start_spiders.py
common problem
- In theory, as long as the seeds you give are comprehensive enough, you can grab as many entries as possible
- Seed link Extraction code: iagw from Baidu Encyclopedia 2012 dump
Known BUG
- Under multi-process crawling, Redis memory will overflow (64G). Currently, it has been changed to 1.5T memory and has not encountered BUG. After running for three hours, the used memory is as high as 38G.
Welcome star!