Big data framework Hadoop ecosystem Episode

1.2 Hadoop ecosystem

Architects and developers often use a software tool used for its intended purpose software development. For example, they might say, Tomcat is the Apache Web server, MySQL is a database tool.

However, when it comes to Hadoop, things get a little complicated. Hadoop includes a number of tools to work together. Therefore, Hadoop can be used to accomplish many things that people often define it according to the way they use.

For some people, Hadoop is a data management systems. They think that is the core Hadoop data analysis, bringing together structured and unstructured data, data distribution in each layer of the traditional enterprise data stack. For others, Hadoop is a massively parallel processing framework, with supercomputing capabilities, targeted at promoting the implementation of enterprise applications. Some people think that as a Hadoop open source community, the main provider of tools and software to solve the problems of big data. Because Hadoop can be used to solve many problems, so many people think Hadoop is a basic framework.

Although Hadoop offers so many features, but it should still be classified as Hadoop ecosystem consisting of multiple components that include data storage, data integration, data processing and other specialized tools for data analysis carried out.

1.3 HADOOP core components

Over time, more and more Hadoop ecosystem, Figure 1-1 shows the Hadoop core components.

 

Figure 1: The core components that make Hadoop ecosystem

Starting from the bottom of Figure 1-1, Hadoop ecosystem consists of the following:

HDFS - Hadoop essential part of the ecosystem is Hadoop Distributed File System (HDFS). HDFS is a distributed data saving mechanism, the data is saved on a computer cluster. Data is written once and read many times. HDFS provides the basis for tools such as HBase.

MapReduce main implementation framework --Hadoop MapReduce is, it is a distributed, parallel processing programming model. MapReduce task into the map (map) phase and reduce (simplification). Developers use to store data in HDFS (for fast storage), written Hadoop's MapReduce tasks. Due to the nature of the works MapReduce, Hadoop can access data in a parallel fashion, enabling fast access to data.

Hbase-- HBase is built on top of HDFS, column-oriented NoSQL database for fast read / write large amounts of data. HBase use Zookeeper managed to ensure that all components are functioning properly.

ZooKeeper  - Hadoop Distributed for the coordination of services. Many of the components is dependent on the Zookeeper Hadoop, which runs above the computer cluster for managing Hadoop operation.

Oozie --Oozie is an expanded operating system, integrated in the stack Hadoop, MapReduce for coordinating a plurality of job execution. It is possible to manage a complex system is performed based on external events, including the emergence of an external event data and timing data.

Pig-- It is the complexity of abstract MapReduce programming. Pig platform includes a runtime environment and analyze Hadoop data sets scripting language (Pig Latin). Its compiler will translate into Pig Latin MapReduce program sequence.

Hive  --Hive high-level language similar to SQL for the query to run the stored on Hadoop, Hive unfamiliar with MapReduce developers can write data queries, then these statements are translated into Hadoop MapReduce tasks above. Like, like Pig, Hive as an abstraction layer tool, which attracted many familiar SQL instead of Java programming data analyst.

Hadoop ecosystem also includes the following framework for integration with other enterprise:

Sqoop is a connection means for transferring data between a relational database, data warehousing and Hadoop. Sqoop introduced using techniques described database schema, the data / Export; parallelized using MapReduce operation and fault tolerance.

Flume provides a distributed, reliable and efficient service for collecting, summarizing large data transfer large amounts of data and a single computer to HDFS. It is based on a simple and flexible architecture, and provides a stream of data flow. It uses a simple and scalable data model to transfer data on multiple computers in the enterprise to Hadoop.

In addition to the core member shown in FIG. 1-1, Hadoop ecosystem is growing, in order to provide updates and components, such as the following:

Whirr-- whirr is a set of Java class libraries used to run cloud services, allowing users to easily run Hadoop cluster computing platform for Amazon EC2, Rackspace and other virtual cloud.

Mahout-- Mahout is a machine learning and data mining library, MapReduce it provides contains many implementations, including clustering algorithms, regression testing, statistical modeling. By using Apache Hadoop library can be extended to effectively Mahout cloud.

BigTop -  BigTop Hadoop as associated components and sub, and the frame is a program for packaging and testing interoperability.

Ambari - Ø Ambar through to configure, manage and monitor Hadoop cluster support, simplifying the management of Hadoop.

Hadoop family members is gradually increasing. In this book, mainly related to the three new hatch Apache Hadoop project.

Incubator project evolved to process the project Apach

The following will briefly describe the operation of the Apache Foundation, as well as their links to each other Apache projects. Apache's individual members shared governance throughout the organization, created to provide Apache project, mature and recovery.

The new project started in "incubator." Build Apache Incubator is to help new projects to join Apache. Apache provides management and inspection, after screening, and then create a new project or subproject. After you create the incubator project, Apache will evaluate the maturity of the project, and is responsible for the incubator project "graduated" to the Apache project or subproject. Incubator will be terminated for various reasons some projects.

To see a complete list of the incubator project (current, hatching success, temporarily stopped and recycling), and through this URL: http: //incubator.apache.org/projects/index.html.

Most of today's Hadoop books on, or focus on a description of the individual components in the Hadoop ecosystem, or how to use Hadoop business analytics tools (such as Pig and Hive). While these aspects are also important, but these books usually do not have in-depth description does not help architects build Hadoop-based enterprise applications or complex applications.

Recommended Reading

40 + annual salary of big data development [W] tutorial, all here!

Big Data technologies inventory

Training programmers to share large data arrays explain in Shell

Big Data Tutorial: SparkShell in writing Spark and IDEA program

Zero-based Big Data Quick Start Tutorial

Java Basic Course

web front-end development based tutorial

Guess you like

Origin blog.csdn.net/chengxvsyu/article/details/92430871