Search engine architecture + search requirements plan evolution
1. Macro
- The whole network search engine, three modules: spider, index, rank
- spider + index: engineering system, major search engines such as Baidu and Google are similar.
- Rank: Business strategy system, this different search engine is different.
- The core of the search results is rank
2. Real-time search engine
Architecture core:
- Index rating
- dump & merge
Implementation points:
-
Real-time fixed-point writing
-
Real-time segmented reading
-
Asynchronous export merge
Three, micro
-
Front row index: url_id quickly find list<item>
-
Inverted index: quickly find list<url_id> for word segmentation item
-
Retrieval process: first segmentation, then find the list<url_id> corresponding to item, and finally find the intersection
-
Intersection of ordered sets:
a. Double for loop, time complexity O(n^2)
b. Zipper method, time complexity O(n)
c. Horizontal bucketing, multi-threaded parallel
d. bitmap, greatly improving calculation Parallelism, time complexity O(n)
e. Adjust the table, time complexity O(log(n))
Four, to meet retrieval needs
- Original stage-LIKE
- Primary stage-mysql full-text index
- Intermediate extreme-open source external indexes, such as ES, Solr, Lucene
- Advanced stage-self-developed search engine
Five, reference
- https://blog.csdn.net/qijiqiguai/article/details/78702506