Practical open source Baidu cloud sharing crawler project yunshare - installation

Today, a Baidu cloud network disk crawler project is open sourced at https://github.com/callmelanmao/yunshare .

Baidu cloud sharing crawler project

There are several such open source projects on github, but they only provide the crawler part. This project also adds a module for saving data and establishing elasticsearch index on the basis of the crawler, which can be used in the actual production environment, but the web module still needs develop it yourself

Install

Install node.js and pm2, node is used to run crawler programs and indexing programs, pm2 is used to manage node tasks

, install mysql and mongodb, mysql is used to save crawler data, mongodb is used to save the final Baidu cloud sharing data, these data are json Format, it is more convenient to save with mongodb.

git clone https://github.com/callmelanmao/yunshare cnpm i

It is recommended to use the cnpm command to install npm dependencies. The easiest way to install

$ npm install -g cnpm --registry=https://registry.npm.taobao.org

more cnpm commands can be found on npm.taobao.org .

initialization

The crawler data (mainly the url list) is stored in the mysql database. Yunshare uses sequelizejs for orm mapping. The source file is there src/models/index.js. The default mysql user name and password are root, and the data is yun. You need to manually create the yun database

create database yun default charset utf8

password according to You need to modify it yourself. After completing the mysql configuration, you can run the following commands.

gulp babel node dist/script/init.js

Note that you must first run gulp babelthe es6 code to compile it into es5, and then run the initialization script to import the initial data. The data file is in data/hot.json, inside, from the page  http://yun.baidu .com/pcloud/fr ... b%3D1  saved.

Startup project

Yunshare uses pm2 for nodejs process management, runs and pm2 start process.jsonstarts all background tasks, and checks whether the tasks are running normally. You can use commands pm2 list. There should be 4 tasks that are running normally.

Start elasticsearch index

The elasticsearch indexing program has also been written, and the mapping file is there data/mapping.json. Please make sure that you have installed the elasticsearch 5.0 version before running the indexing program, command pm2 start dist/elastic.js.

The default elasticsearch address is http://localhost:9200 . If you need to modify this address, you can src/ElasticWorker.jsmodify it there. After modifying any js source code, remember to run gulp babelit and restart the pm2 task, otherwise the modification will not take effect.

After completing the elasticsearch configuration, you can also add an elastic task to process.json, so that you do not need to start the indexer separately.

DEMO

The next article on Bili Search

will then introduce the overall design ideas of the entire project and the problems encountered in the development process.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327099188&siteId=291194637