Alibaba cloud disk search engine project architecture diagram

Alibaba cloud disk search engine project architecture diagram

This project is a personal project, so the focus of consideration factors is different from that of enterprise products, and it will be more inclined to save costs and ensure stability as much as possible.

insert image description here

In order to complete this project, more than the following two solutions have been considered, including:

  • Search engine selection: Tried the toy version of meiliSearch / directly used MongoDB and the final choice of Elasticsearch
  • Construction of the crawler proxy pool: It must be highly anonymous, otherwise crawling IP requests will be restricted, and the more node IPs in the proxy pool, the more concurrent, and the faster the crawling data will be.

Next, let's briefly talk about the next two alternatives.

Option One

In consideration of data stability, MongoDB is added as a data persistence solution.

  • Data persistence: mongo replica set (three nodes)
  • Search engine: Elasticsearch cluster (three nodes)
  • Crawler service cache: Redis caches the crawler result data, and then synchronizes it to MongoDB through the Transfer service.

The architecture diagram is as follows:

insert image description here

Evaluation of this program:

  • The data is more stable and secure. After the data operations are completed on MongoDB, Monstache is automatically synchronized to the ES cluster.
  • Due to the addition of MongoDB clusters, more hardware resources are required, so the deployment cost is higher.

Option II

In order to save costs, this solution omits the MongoDB cluster and directly synchronizes data to the ES cluster, which is equivalent to deleting Monstacheand MongoDB复制集.

constituency_2076x863_315.png

This solution is simpler and more direct, and requires a lot less hardware resources. Relatively speaking, it is more suitable for personal projects.

As for the persistence of data, you can choose to store it in the form of files. For the data of the deletion operation, only idthe information of the deleted record can be retained. To restore the data, choose to import all the data and then delete the invalid idlist.

end

At present, option 1 is implemented, but the cost of cluster deployment is relatively high, so you can run MongoDB clusters and crawler-related services on your home personal server (no cost), and only put ES clusters and web services on cloud servers ( money in the burning code), coupled with the snail-slow CloudflareCDN service, it can basically run stably.

At present, the web service is developed in Golang, without Resfull API interface. I am considering using Vue to implement the web front end and perform data interaction through the Resful API interface, but this will inevitably lead to the risk of data being crawled. I am also in the process of continuous learning, and finally , these immediate problems will be resolved.

If you like this project, you can click three times and continue to follow me.

Guess you like

Origin blog.csdn.net/dragonballs/article/details/126438318