Detailed comparison of es_MySQL, HBase, and ElasticSearch

1. Concept introduction

MySQL: relational database, mainly for OLTP (OLTP, also called Online Transaction Processing (Online Transaction Processing)), supports transactions, supports secondary indexes, supports sql, supports master-slave, and Group Replication (MGR is a new high-availability and Highly scalable solution, the data of any node in the cluster is the same, and any node can be written to, realizing multi-master in the true sense.) Architecture model (this article uses Innodb as an example, and does not involve other storage engine).

HBase: Based on HDFS, it supports massive data reading and writing (especially writing), supports hundreds of millions of rows and millions of columns, and is a column-oriented distributed NoSql database. Naturally distributed, master-slave architecture, does not support transactions, does not support secondary indexes, and does not support SQL.

ElasticSearch: ES for short is a distributed full-text search framework, the bottom layer is based on Lucene technology. Although ES also provides storage and retrieval functions, I have never thought that ES is a database, but as ES functions become more and more powerful, The line with the database is also getting blurred. Distributed, P2P architecture, but does not support transactions, using inverted index to provide full-text search.

2. Data storage method

Suppose there is such a personnel information table:
insert image description here
MySQL database needs to define the table structure in advance, how many columns (attributes) the data table has must be defined in advance, and at the same time, the storage space occupied by each column needs to be defined. Data is organized in units of rows. If there is no data in a column of a row, storage space will be required.

HBase stores data in columns, and each column is a key-value. HBase table columns (attributes) do not need to be defined in advance, and columns can be dynamically expanded. For example, a new "address" field needs to be added to the personnel information table , MySQL needs to add fields in the alter table in advance, and HBase can be inserted directly.

ES is more flexible. The field type in the index can be defined in advance (defining mapping), or not defined. If not defined, there will be a default type. However, for the sake of controllability, it is recommended to define key fields in advance. (The schema.xml file must be defined in advance in Solr.)
insert image description here
The figure above shows the difference between data storage in MySQL and HBase (there is still a gap with the real situation). You can see that even if the sex field of the second record is empty, MySQL still Space will be reserved for this field, because there may be an update statement to update the record and add sex content later. HBase, on the other hand, regards each column as a record, row+column name is used as key, and data is used as value, which are stored in sequence. If there is no data in a column of a certain row, skip the column directly. For large tables with sparse matrices, HBase can greatly save storage space.

Seeing this, do you have a question: When using HBase storage, if you need to add the sex content of the second line at this time, how to achieve it, and whether the data is continuous? The reading and writing process will be explained later.

3. Different ES

The storage method of ES is different from the above two. MySQL and HBase store data in different ways. Anyway, they still store data, while ES stores inverted indexes. Let's first understand what an inverted index is and why an inverted index is needed:

We all have this kind of experience for sure: we happen to see a good piece of text, but we don’t know the source. At this time, we go to the library and search one by one, which is undoubtedly a needle in a haystack. The core is the inverted index. If you have the following documents:

insert image description here

We want to know which documents contain the keyword you. First, we can create an inverted index with the following format:

insert image description here
The front part is called dictionary (dictionary), each word in it is called term, and the following document list is called psoting-list, which records all document ids containing the term, and the combination of the two is a completed inverted index ( Inverted Index). It can be seen that if you need to search for documents containing "you", you can find the corresponding posting-list according to the dictionary.

In full-text search, creating an Inverted Index is the most critical and time-consuming process, and the real Inverted Index structure is far more complicated than that shown in the figure. Not only does it need to segment the document (Chinese in ES can customize the word segmenter), It is also necessary to calculate TF-IDF, which is convenient for score sorting (when looking for you, the score determines which doc is displayed first, which is the so-called search ranking), compression and other operations. Every time a document is received, ES will update its information in the inverted index.

It can be seen that the storage of ES, MySQL, and HBase is still very different. Moreover, ES not only includes an inverted index, but also stores the document doc by default, so when we use ES, we can also get complete document information, so to some extent, it feels like using a database, but also It can be configured not to store document information. At this time, only the document id can be obtained according to the query conditions, and the complete document content cannot be obtained.

Summarize:

MySQL: The row storage method is more suitable for OLTP business.

HBase: The way of column storage is more suitable for OLAP business, and HBase adopts the method of column family to balance OLTP and OLAP, and supports horizontal expansion. If the amount of data is relatively large, the performance requirements are not so high, and there are no requirements for transactions, HBase Can consider it.

ES: ES has indexes for all fields by default, so it is more suitable for complex retrieval or full-text retrieval.

Guess you like

Origin blog.csdn.net/chuige2013/article/details/129463877