Big Data ecosystem --Hadoop

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/qq_39530692/article/details/85008127

First, let's explain what is Hadoop.

 

 

Hadoop has two core components, one for HDFS, so-called HDFS, is a distributed file storage system.

Two for Mapreduce, that is a distributed computing system (distributed computing frameworks offline).

The above two components, to solve the problem of large data storage, there was computed a large data.

The remaining two are basically derived from the tool.

Maperduce programming languages:

1, Jave (the most primitive way)

2, Hadoop Streaming (support multiple languages)

3, Hadoop Pipes (for C and C ++)

Mahout algorithm provides: classification, clustering, frequent pattern mining, vector similarity calculation, recommendation engines, dimension reduction, evolutionary algorithms, regression analysis, etc.

Hive: data warehouse is built on top of Hadoop for massive statistical problem solving unstructured log data, the structure of language HQL, similar to SQL, but not identical.

Pig: Hadoop-based data flow execution engine using MapReduce parallel processing of data, using Pig Latin language data stream

Hive: That is Hive2 (Stinger), is replaced by the underlying algorithm engine Tez (DGA calculated frame) the MapReduce

Impala: processing data can be stored directly on the HDFS, and writes data to HDFS times, with good scalability and fault tolerance, for fast interactive query.

Oozie:

Guess you like

Origin blog.csdn.net/qq_39530692/article/details/85008127