Elasticsearch vs hadoop comparison

In the field of log analysis over the past few years, the open source search engine Elasticsearch has become increasingly popular. It forms the ELK analysis portfolio together with its open source server-side log collection product Logstash, and its open source visualization tool kibana. This powerful combination is gaining momentum.

 

Elasticsearch is a distributed search server based on Lucene. It stores document data in json format and has a RESTful-based operation interface. Using Elasticsearch can easily integrate search applications in any web application. In addition, it has an excellent aggregation function (aggregation), which can easily perform statistical analysis on data. At this point, Elasticsearch has surpassed its original pure search engine role, but if it is really used as a complex data analysis tool , can it beat hadoop or spark?

 

Why Elasticsearch is Popular

 

1. Elasticsearch cluster instances are easy to set up.

2. The query language based on json format is easier to master than developing MapReduce or spark systems.

3. Developers can easily integrate Elasticsearch into Hadoop.

 

These are very compelling features, and using Elasticsearch can quickly build an analysis system. But can it be considered that Elasticsearch is a highly available data analysis platform? To become a mature and highly available data analysis platform, a highly available data storage system and a set of computing frameworks that can support complex data are essential.

 

For distributed data storage, data consistency in the Elasticsearch cluster is one of our concerns. Under normal circumstances, all nodes in the cluster should be consistent in the selection of the master in the cluster, and the obtained state information should also be consistent. Since each node in the Elasticsearch cluster is a state maintainer, the network in the cluster is not connected. In a stable situation, there may be a cluster split brain (different nodes choose the master node abnormally). The Elasticsearch cluster in a normal environment as shown in the figure.

 

 

When a network abnormality occurs, the master node is lost, and the selection of the master node by different nodes is abnormal.

 

 

This means that if we want to guarantee the consistency and integrity of the data, we must store the data in a more reliable database.

 

Unlike Elasticsearch, split-brain conditions are effectively avoided in Hadoop. As shown below

 

 

The primary namenode maintains the datanode state, and the primary and secondary namenodes synchronize information; it is guaranteed that only one primary namenode is used at any time to manage the datanode state in the cluster.

 

Elasticsearch has powerful aggregate statistics and full-text search functions, which can be easily used for network problem analysis, such as 404 error counts, page views, user access statistics, etc. But it lacks functionality like joins or subqueries in standard SQL. Elasticsearch does not support additional processing of query results or output of intermediate data for analysis, nor does it support transformation of datasets (that is, a table with 1 million rows becomes another table with 1 million rows after analytical processing), so it is not very Suitable for handling complex computational logic.

 

Instead, using Hadoop's mapreduce or spark computing framework, we can support processing any data aggregation and transformation work; we can also use hive or spark SQL to reduce the difficulty of our development.

 

Although Elasticsearch has these problems, it is still a very good distributed computing framework, and Elasticsearch can be easily integrated into hadoop. We can also use its excellent data retrieval capabilities to construct our own query system; at the same time, Elasticsearch still In the non-stop version iteration, I believe that Elasticsearch will solve these problems step by step in future versions.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326374697&siteId=291194637