Chapter 2 - Hadoop big data processing block
Article Directory
Introduction to Hadoop
Hadoop concept
Apache Software Foundation's Hadoop is an open source distributed computing platform that provides users with system-level details transparent distributed infrastructure. Hadoop big data is recognized as the industry standard open source software, it provides the processing power of huge amounts of data in a distributed environment
Hadoop is a Java-based development, with a good cross-platform features, and can be deployed in an inexpensive computer cluster. Hadoop can support a variety of programming languages: C / C ++, Java, Python and so on.
The core Hadoop is a distributed file system, HDFS (Hadoop Distributed File System) and MapReduce.
Hadoop is a software framework is capable of distributed processing large amounts of data, and is a reliable, efficient and scalable way of processing, it has the characteristics of the following aspects:
- High reliability
- High efficiency
- High Scalability
- High fault tolerance
- low cost
Hadoop with its outstanding advantages, has been widely applied in various fields, and the Internet is the main front of its field of application
- Facebook Hadoop platform for the main aspects of log processing, recommendation systems and data warehouses
- Domestic use Hadoop company mainly Baidu, Alibaba, Netease, Huawei, China Mobile, which is relatively large Hadoop cluster Ali
Hadoop in the enterprise application architecture:
Hadoop version
Apache Hadoop version of evolution:
- Hadoop1.0
- Hadoop0.20.x
- Hadoop0.21.x
- Hadoop0.22.x
- Hadoop2.0
- Hadoop0.23.x
- Hadoop2.x
- Hadoop3.0
- Hadoop3.x
Hadoop 3.x version has not been adequately tested, there may also be some questions. Currently it is recommended to use Hadoop 2.x version.
Hadoop1.0 core components include only MapReduce and HDFS, there is a single node name, a single name space, resource management and low efficiency. Hadoop2.0 designed the HDFS HA, provide the name of the node hot standby mechanism; designed HDFS Federation, manage multiple namespaces; designed a new resource management framework YARN.
Hadoop Optimization and Development
Optimization and development of Hadoop is mainly reflected in two aspects:
- Hadoop itself two core components of MapReduce and HDFS architecture design improvements
- Hadoop ecosystem continues to enrich other components, adding new components Pig, Tez, Spark, etc.
Package | Features | solved problem |
---|---|---|
Pig | Large-scale data processing scripting language, users only need to write a few simple sentences, the system will automatically convert into MapReduce jobs | Low level of abstraction, writing a lot of code that need to be manually |
Spark | Memory-based distributed parallel programming framework, having a high real-time, and better support for the iterative calculation | Delay high, and unfit to perform iterative calculations |
Oozie | Workflow engine and collaboration services, coordination of different tasks running on Hadoop | Dependencies between management mechanism does not provide job (Job), require the user to handle dependencies between jobs |
Also | Computing framework support DAG job, job re-operation decomposition and combined to form a large DAG operations, to reduce unnecessary operation | Duplicate MapReduce operations between different tasks, reducing the efficiency |
Kafka | Distributed publish-subscribe messaging system, different types of distributed systems can be unified access to Kafka, efficient real-time switching between different types of data to achieve various components and Hadoop | Hadoop ecosystem lack of uniform between the various components and other products, efficient data exchange intermediary |
Hadoop ecosystem
Hadoop project structure and constantly enrich the development, has formed a rich ecosystem Hadoop
Data Warehouse (Data Warehouse) is a subject-oriented, integrated, relatively stable, reflecting the historical changes in data collection to support management decisions.
The main difference is that the data warehouse database:
- Is a transaction-oriented database design, data warehouse is a subject-oriented design
- General online database to store transaction data, data warehouse storage of historical data generally
- Database design is to avoid redundancy in data warehouse design is intentionally introducing redundancy
- The database is designed to capture the data, the data warehouse is designed for the analysis of data
Hive
Hive is built on top of Hadoop data warehouse tools to support large-scale data storage, analysis, has good scalability, to some extent it can be seen as a user programming interface itself does not store and process data.
- HDFS distributed file systems rely on data storage
- Parallel Distributed computing model-dependent data processing MapReduce
- Defines a simple SQL-like query language --HiveQL, users can write HiveQL statement runs MapReduce task
- It can easily be built on the original relational database data warehouse applications to the Hadoop platform
Hive itself offers a range of data extract, transform, load (ETL) tool that can store, query and analyze data stored in Hadoop in large scale, these tools can satisfy a variety of data warehouse scenarios.
Hive in many respects similar to traditional relational database, but its underlying relies on HDFS and MapReduce, so in many respects but also different from traditional database.
Compared | Hive | Traditional database |
---|---|---|
Data insertion | Support Batch Import | Supports single and bulk import |
Data Update | not support | stand by |
index | stand by | stand by |
Execution delayed | high | low |
Expansibility | it is good | limited |
Hive Application examples: WordCount
- WordCount algorithm using Hive realize the amount of code you need to write less
- In the implementation of MapReduce, we need to be compiled jar file to execute algorithms, and do not need the Hive. Ultimately HiveQL statement needs to be converted into MapReduce tasks to perform, this is done automatically by the framework of the Hive, users do not need to understand the specific implementation details
Hive applications in the enterprise big data analytics platform, sometimes Pig Hive can be used as an alternative tool.
Pig
Pig is a large-scale Hadoop-based data analytics platform that simplifies Hadoop common tasks. Pig data can be loaded, and storing the converted data expressing the final result, so that the built-in operating Pig semistructured data becomes meaningful (such as log files). Meanwhile Pig extensible using custom data types in Java and add support for data conversion.
- Provide a SQL-like query language Pig Latin
- It allows the user to implement complex data analysis through a simple scripting, without the need to write complex MapReduce applications
- Pig will be automatically converted into a user-written script MapReduce jobs run on a Hadoop cluster and MapReduce programs have generated for the automatic optimization function
- Hive general processing of structured data, Pig unstructured data can be processed. Process flow: LOAD-> Conversion -> STORE / DUMP
Tez: supports computing framework DAG job. The core idea is to Map and Reduce further split two operations, the operation element can be freely decomposed combination, produce new operations, which after some program control assembly, may be formed in a large job DAG.
Important components of Hadoop
Hadoop ecosystem components important
Package | Features |
---|---|
HDFS | Distributed File System |
MapReduce | Distributed Parallel Programming Model |
YARN | Resource manager and scheduler |
Also | Running the next generation HAdoop query processing framework on top of YARN |
Hive | Data warehouse on Hadoop |
HBase | Non-relational distributed database on Hadoop |
Pig | A large-scale data analysis Hadoop-based platform that provides SQL-like query language Pig Latin |
Sqoop | For data transfer between the database and traditional Hadoop |
Oozie | Workflow management system on Hadoop |
Zookeeper | Provide coordinated services distributed |
Storm | Flow computing framework |
Flume | A highly available, highly reliable, distributed massive log collection, aggregation and transmission system |
Ambari | Hadoop rapid deployment tools that support supply Apache Hadoop cluster management and monitoring |
Kafka | A high-throughput distributed publish-subscribe messaging system that can handle all the action streaming data consumer-scale site |
Spark | Analogous to general parallel frame of Hadoop MapReduce |
Hadoop cluster deployment
Hadoop installation:
- Stand-alone Mode: Hadoop non-distributed mode the default mode (local mode), you can run without the need for additional configuration.
- 伪分布式模式:Hadoop 可以在单节点上以伪分布式的方式运行,Hadoop 进程以分离的 Java 进程来运行,节点既作为 NameNode 也作为 DataNode,同时,读取的是 HDFS 中的文件。
- 分布式模式:使用多个节点构成集群环境来运行 Hadoop
Hadoop框架中最核心的设计是为海量数据提供存储的 HDFS 和对数据进行计算的 MapReduce。MapReduce的作业主要包括:
- 从磁盘或从网络读取数据,即IO密集工作
- 计算数据,即CPU密集工作
Hadoop集群的整体性能取决于CPU、内存、网络以及存储之 间的性能平衡。因此运营团队在选择机器配置时要针对不同 的工作节点选择合适硬件类型。
一个基本的 Hadoop 集群中的节点主要有:
- NameNode:负责协调集群中的数据存储
- DataNode:存储被拆分的数据块
- JobTracker:协调不同机器上数据计算任务
- TaskTracker:负责执行由 JobTracker 指派的任务
- SecondaryNameNode:帮助 NameNode 收集文件系统运行的状态信息
在集群中,大部分的机器设备是作为 Datanode和 TaskTracker 工作的。NameNode提供整个 HDFS 文件系统的 NameSpace(命名 空间)管理、块管理等所有服务,很多元数据是直接保存在 内存中的,因此需要更多的RAM,与集群中的数据块数量 相对应,并且需要优化 RAM 的内存通道带宽。SecondaryNameNode在小型集群中可以和 NameNode 共用一台机器,较大的群集可以采用与NameNode相同的硬件。
Hadoop集群规模可大可小,初始时,可以从一个较小规模的集群开始,比如包含10个节点,然后,规模随着存储器和计算需求的扩大而扩大。
For a small cluster, a node name (the NameNode) and JobTracker running on a single node, is generally acceptable. However, with the increase in the number of files stored in HDFS clusters and the name of the node requires more main memory, time, name of the node and JobTracker needs to run on different nodes.
The second node name (SecondaryNameNode) and the name of the node can be running on the same machine, however, since the second node names and node names have almost the same main memory requirements, therefore, preferably run on two different nodes.
Common Hadoop cluster structure consists of a two-stage network. Each rack (Rack) 30-40 servers, a switch arranged 1GB, and to transfer the core to a switch or router (1GB or more).