Hadoop Ecosystem: Product Analysis of Computing Engines and Tools

 Modes of big data fractional computing and stream computing:

1. Batch calculation (fixed segment static data with start and end time)

2. Stream computing (unbounded/dynamic data)

3. Interactive

4. Graph computing

Big Data Analysis Hadoop:

1. A data analysis tool based on hdfs (distributed file storage) + mapredus (data key-value pairs). historical source "three carriages" gfs/bigtable/mp

2. Yarm (resource management framework)

3. Sqoop (database migration tool)

4. Mahout (data mining algorithm library)

5. Hbase (distributed storage system)

6. Zookeeper (distributed collaboration service)

7. Hive (data warehouse tool)

8. Fiume (log collection tool)

9.Spark (general computing engine)

10. impala (new query system)

11.kafka (distributed message queue)

12.ambari (big data cluster management)

13.oozie (workflow scheduling)

Comparative analysis of big data processing capabilities:

Hive is the java implementation of jdbc/odbc , and the data warehouse management tool of WebGui.

The Spark hybrid framework provides an interactive programming experience and optimizes the MR computing model , but it is not scalable or stable, and it is still based on hdfs+yarn, which is not commercially available. Pipeline micro-batch processing, high throughput but high latency, second level.

Impala -- or rdbms, is a fast data query tool that bypasses MR and is an order of magnitude larger than MR queries .

Trino (Presto) provides interactive query, features: support multiple data sources, provide (heterogeneous + federal) query. Trino distributed sql query engine, used for high-speed, real-time data query  Presto is created to solve the problem that Hive's MapReduce model is too slow and cannot display HDFS through BI and other tools. Performance optimization: Presto / Trino supports memory parallel processing, Cross-cluster node pipeline execution, multi-threaded execution model, efficient flat memory data structure (minimizing Java garbage collection), Java bytecode generation. Outperformed Impala and Spark SQL.

Hbase is suitable for billions to tens of billions of data processing , and hdfs has at least 5 nodes and above.

The Flink hybrid framework is based on message queues for event processing, real-time computing, and stream-batch integration. Lightweight, fault-tolerant, high throughput, low latency in milliseconds.

Guess you like

Origin blog.csdn.net/weixin_29403917/article/details/128113823