The difference between Hermes and open source Solr and ElasticSearch

When it comes to Hermes' indexing technology, I believe many students will think of Solr and ElasticSearch. Solr and ElasticSearch are really famous. They are two top-level projects. Recently, some students often ask me, "Why use Hermes in the open source world with Solr and ElasticSearch?"

Before answering this question, you can think about a question, since there are already Why do you still use Hive and Spark under ES when you have databases such as Oracle and MySQL? Oracle and MySQL also have cluster versions, which can also be distributed. Is the emergence of ES and Hive redundant?

The emergence of Hermes is not to replace Solr and ES, just like the emergence of ES is not to kill Oracle and MySQL, but to meet the needs of different levels.

1. Hermes and Solr, ES positioning is different

Solr\ES: It is more inclined to provide full-text retrieval services for small-scale data; Hermes: It is more inclined to provide index support for large-scale data warehouses, provide ad-hoc analysis solutions for large-scale data warehouses, and reduce the cost of data warehouses , Hermes data volume is more "large".

The features of Solr and ES are as follows:

1. Derived from search engines, focusing on search and full-text retrieval.
2. The scale of data ranges from a few million to tens of millions, and there are very few clusters with a data volume of over 100 million.

PS: There may be more than 100 million data in individual systems, but this is not a common phenomenon (just like the data size in Oracle's table may exceed that in Hive, but a minicomputer is required).

The usage features of Hermes: are as follows:

1. A real-time retrieval and analysis platform for massive data based on large index technology. Focus on data analysis.
2. The scale of data ranges from hundreds of millions to trillions. The smallest table is also in the tens of millions. In Tencent's 17 TS5 machines, it can process 45 billion data per day (each data is about 1kb), and the data can be stored for a month.

2. Some differences between Hermes and Solr, ES in technical implementation

Solr and ES have problems with large indexes:

1. The first-level jump table is completely loaded in memory.

Not to mention that this method consumes a lot of memory, the loading speed of opening the index for the first time will be particularly slow. The index in Solr\ES is always open and will not be opened and closed frequently; this mode will restrict the number and scale of indexes on a machine. Usually, a machine is fixedly responsible for the index of a certain business.

2. To sort, load all the values ​​of the column into memory.

When sorting and statistics (sum, max, min), it traverses the inverted table, loads all the values ​​of a certain column into memory, and then performs statistics based on memory data, even if only one of the records is used for a query , it will also load all the values ​​of the entire column into the memory, which is too wasteful of resources and the performance of the first query is too poor. The data size is greatly limited by physical memory, and OOM is a common occurrence after the index scale reaches tens of millions.

3. The index is stored on the local hard disk, which is difficult to restore

Once the machine is damaged, even if the data is not lost, an index of several terabytes will only take several hours to complete the data copy time.

4. The cluster size is too small

Support Master/Slave mode, but like traditional MySQL database, the cluster scale is not particularly large (less than 100 units). In addition to the limited scale of the processing cluster, the data migration for each expansion will be a very painful thing, and the data migration time is too long.

5. Data skew problem

Inverted search even if there is a data skew for a certain word, due to the relatively small amount of data, all doc lists (such as male and female) can be read. This doc list will take up a large amount of memory for Cache. Of course, in When the data scale is small, it does not occupy a lot of memory, and the query hit rate is high, which will improve the retrieval speed. However, after the data scale increases, the memory problem here becomes more and more serious.

6. Node and data scale is limited

There can only be one Merger Server, which restricts the number of nodes to be queried; data cannot be dynamically partitioned, and a single index is too large after the data scale increases.

7. In the case of high concurrent import, the GC occupies too much CPU, and the multi-threaded concurrent performance cannot be improved.

AttributeSource uses WeakHashMap to manage the instantiation of classes and uses global locks. No matter how many threads are added, the import performance will not improve.

AttributeSource and NumbericField use a lot of LinkHashMap and a lot of useless objects, causing each record to create a lot of useless objects in memory, causing the JVM to recycle these objects frequently, and the CPU consumption is too high.

The WeakHashMap used by FieldCacheImpl has bugs, and there is a risk of OOM in the case of big data.

The single-machine import performance is in the author's environment (1kb records are difficult to break through 2w/s per machine)

Solr and ES Summary

This is not to say that Solr and ES are not good. In the case of small data scale, Solr's processing method is superior, with better concurrent performance and higher Cache utilization. Facts have proved that Solr and ES are in the production field. It is very stable and has excellent performance; but in the case of large data scale and frequent real-time import of data, some optimizations are required.

Hermes' improvements on indexing:

1. Indexes are loaded on demand

Most of the indexes are closed, and they will be opened only when the indexes are actually used; the first-level jump table adopts on-demand load, and does not load the entire jump table, which is used to save memory and improve the speed of opening indexes. Hermes often dynamically opens different indexes according to different businesses, and closes those indexes that are not frequently used, so that the same machine can be used by a variety of different businesses, and the machine utilization rate is high.

2. Sorting and statistics are loaded on demand

Sorting and statistics do not use the real value of the data, but convert the big data into data labels that occupy a small amount of memory through labeling technology, which occupies a few tenths of the original memory.

In addition, all the values ​​of this column will not be loaded into the memory, but which data is used to load which data is still loaded on demand. Unused data is removed from memory.

3. Indexes are stored in HDFS

In theory, as long as there is space in HDFS, indexes can be added continuously. The index size is no longer severely limited by the physical memory and physical disks of the machine, and disaster recovery and data migration are much easier.

4. Use Gaia for process management (Tencent version of Yarn)

Data in HDFS, cluster size and expansion are very easy things, Gaia has reached 10,000 clusters in Tencent).

5. Use multi-condition combined jump to reduce data skew

If there is a data skew for a word, it will be combined with other conditions to jump and merge (refer to the Skip List information of Doclist).

6. Multi-level Merger and custom partition

7. Some optimizations were made on the GC

Self-management of memory, creation and release of memory objects in key places is controlled by Java itself, reducing GC pressure (similar to Hbase's Block Buffer Cache). Without WeakHashMap and global locks, improper use of WeakHashMap is prone to memory leaks and poor performance. Related objects used for word segmentation are shared, reducing repeated creation and release of objects. 1kb data, in the author's environment, a machine can process 4~8W records per second.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326320455&siteId=291194637