Today, a Baidu cloud network disk crawler project is open sourced at https://github.com/callmelanmao/yunshare .
Baidu cloud sharing crawler project
There are several such open source projects on github, but they only provide the crawler part. This project also adds a module for saving data and establishing elasticsearch index on the basis of the crawler, which can be used in the actual production environment, but the web module still needs develop it yourself
Install
Install node.js and pm2, node is used to run crawler programs and indexing programs, pm2 is used to manage node tasks
, install mysql and mongodb, mysql is used to save crawler data, mongodb is used to save the final Baidu cloud sharing data, these data are json Format, it is more convenient to save with mongodb. git clone https://github.com/callmelanmao/yunshare cnpm i
It is recommended to use the cnpm command to install npm dependencies. The easiest way to install $ npm install -g cnpm --registry=https://registry.npm.taobao.org
more cnpm commands can be found on npm.taobao.org .
initialization
The crawler data (mainly the url list) is stored in the mysql database. Yunshare uses sequelizejs for orm mapping. The source file is there src/models/index.js
. The default mysql user name and password are root, and the data is yun. You need to manually create the yun database create database yun default charset utf8
password according to You need to modify it yourself. After completing the mysql configuration, you can run the following commands. gulp babel node dist/script/init.js
Note that you must first run gulp babel
the es6 code to compile it into es5, and then run the initialization script to import the initial data. The data file is in data/hot.json
, inside, from the page http://yun.baidu .com/pcloud/fr ... b%3D1 saved.
Startup project
Yunshare uses pm2 for nodejs process management, runs and pm2 start process.json
starts all background tasks, and checks whether the tasks are running normally. You can use commands pm2 list
. There should be 4 tasks that are running normally.
Start elasticsearch index
The elasticsearch indexing program has also been written, and the mapping file is there data/mapping.json
. Please make sure that you have installed the elasticsearch 5.0 version before running the indexing program, command pm2 start dist/elastic.js
.
The default elasticsearch address is http://localhost:9200 . If you need to modify this address, you can src/ElasticWorker.js
modify it there. After modifying any js source code, remember to run gulp babel
it and restart the pm2 task, otherwise the modification will not take effect.
After completing the elasticsearch configuration, you can also add an elastic task to process.json, so that you do not need to start the indexer separately.
DEMO
The next article on Bili Search
will then introduce the overall design ideas of the entire project and the problems encountered in the development process.