Chapter 2 - Hadoop big data processing block

Chapter 2 - Hadoop big data processing block

Introduction to Hadoop

Hadoop concept

Apache Software Foundation's Hadoop is an open source distributed computing platform that provides users with system-level details transparent distributed infrastructure. Hadoop big data is recognized as the industry standard open source software, it provides the processing power of huge amounts of data in a distributed environment

Hadoop is a Java-based development, with a good cross-platform features, and can be deployed in an inexpensive computer cluster. Hadoop can support a variety of programming languages: C / C ++, Java, Python and so on.

The core Hadoop is a distributed file system, HDFS (Hadoop Distributed File System) and MapReduce.

Hadoop is a software framework is capable of distributed processing large amounts of data, and is a reliable, efficient and scalable way of processing, it has the characteristics of the following aspects:

  • High reliability
  • High efficiency
  • High Scalability
  • High fault tolerance
  • low cost

Hadoop with its outstanding advantages, has been widely applied in various fields, and the Internet is the main front of its field of application

  • Facebook Hadoop platform for the main aspects of log processing, recommendation systems and data warehouses
  • Domestic use Hadoop company mainly Baidu, Alibaba, Netease, Huawei, China Mobile, which is relatively large Hadoop cluster Ali

Hadoop in the enterprise application architecture:
d21

Hadoop version

Apache Hadoop version of evolution:

  • Hadoop1.0
    • Hadoop0.20.x
    • Hadoop0.21.x
    • Hadoop0.22.x
  • Hadoop2.0
    • Hadoop0.23.x
    • Hadoop2.x
  • Hadoop3.0
    • Hadoop3.x

Hadoop 3.x version has not been adequately tested, there may also be some questions. Currently it is recommended to use Hadoop 2.x version.

Hadoop1.0 core components include only MapReduce and HDFS, there is a single node name, a single name space, resource management and low efficiency. Hadoop2.0 designed the HDFS HA, provide the name of the node hot standby mechanism; designed HDFS Federation, manage multiple namespaces; designed a new resource management framework YARN.

Hadoop Optimization and Development

Optimization and development of Hadoop is mainly reflected in two aspects:

  • Hadoop itself two core components of MapReduce and HDFS architecture design improvements
  • Hadoop ecosystem continues to enrich other components, adding new components Pig, Tez, Spark, etc.
Package Features solved problem
Pig Large-scale data processing scripting language, users only need to write a few simple sentences, the system will automatically convert into MapReduce jobs Low level of abstraction, writing a lot of code that need to be manually
Spark Memory-based distributed parallel programming framework, having a high real-time, and better support for the iterative calculation Delay high, and unfit to perform iterative calculations
Oozie Workflow engine and collaboration services, coordination of different tasks running on Hadoop Dependencies between management mechanism does not provide job (Job), require the user to handle dependencies between jobs
Also Computing framework support DAG job, job re-operation decomposition and combined to form a large DAG operations, to reduce unnecessary operation Duplicate MapReduce operations between different tasks, reducing the efficiency
Kafka Distributed publish-subscribe messaging system, different types of distributed systems can be unified access to Kafka, efficient real-time switching between different types of data to achieve various components and Hadoop Hadoop ecosystem lack of uniform between the various components and other products, efficient data exchange intermediary

Hadoop ecosystem

Hadoop project structure and constantly enrich the development, has formed a rich ecosystem Hadoop
d22

Data Warehouse (Data Warehouse) is a subject-oriented, integrated, relatively stable, reflecting the historical changes in data collection to support management decisions.

The main difference is that the data warehouse database:

  1. Is a transaction-oriented database design, data warehouse is a subject-oriented design
  2. General online database to store transaction data, data warehouse storage of historical data generally
  3. Database design is to avoid redundancy in data warehouse design is intentionally introducing redundancy
  4. The database is designed to capture the data, the data warehouse is designed for the analysis of data

Hive

Hive is built on top of Hadoop data warehouse tools to support large-scale data storage, analysis, has good scalability, to some extent it can be seen as a user programming interface itself does not store and process data.

  • HDFS distributed file systems rely on data storage
  • Parallel Distributed computing model-dependent data processing MapReduce
  • Defines a simple SQL-like query language --HiveQL, users can write HiveQL statement runs MapReduce task
  • It can easily be built on the original relational database data warehouse applications to the Hadoop platform

Hive itself offers a range of data extract, transform, load (ETL) tool that can store, query and analyze data stored in Hadoop in large scale, these tools can satisfy a variety of data warehouse scenarios.

Hive in many respects similar to traditional relational database, but its underlying relies on HDFS and MapReduce, so in many respects but also different from traditional database.

Compared Hive Traditional database
Data insertion Support Batch Import Supports single and bulk import
Data Update not support stand by
index stand by stand by
Execution delayed high low
Expansibility it is good limited

Hive Application examples: WordCount

  • WordCount algorithm using Hive realize the amount of code you need to write less
  • In the implementation of MapReduce, we need to be compiled jar file to execute algorithms, and do not need the Hive. Ultimately HiveQL statement needs to be converted into MapReduce tasks to perform, this is done automatically by the framework of the Hive, users do not need to understand the specific implementation details

Hive applications in the enterprise big data analytics platform, sometimes Pig Hive can be used as an alternative tool.
d23

Pig

Pig is a large-scale Hadoop-based data analytics platform that simplifies Hadoop common tasks. Pig data can be loaded, and storing the converted data expressing the final result, so that the built-in operating Pig semistructured data becomes meaningful (such as log files). Meanwhile Pig extensible using custom data types in Java and add support for data conversion.

  • Provide a SQL-like query language Pig Latin
  • It allows the user to implement complex data analysis through a simple scripting, without the need to write complex MapReduce applications
  • Pig will be automatically converted into a user-written script MapReduce jobs run on a Hadoop cluster and MapReduce programs have generated for the automatic optimization function
  • Hive general processing of structured data, Pig unstructured data can be processed. Process flow: LOAD-> Conversion -> STORE / DUMP

Tez: supports computing framework DAG job. The core idea is to Map and Reduce further split two operations, the operation element can be freely decomposed combination, produce new operations, which after some program control assembly, may be formed in a large job DAG.

Important components of Hadoop

Hadoop ecosystem components important

Package Features
HDFS Distributed File System
MapReduce Distributed Parallel Programming Model
YARN Resource manager and scheduler
Also Running the next generation HAdoop query processing framework on top of YARN
Hive Data warehouse on Hadoop
HBase Non-relational distributed database on Hadoop
Pig A large-scale data analysis Hadoop-based platform that provides SQL-like query language Pig Latin
Sqoop For data transfer between the database and traditional Hadoop
Oozie Workflow management system on Hadoop
Zookeeper Provide coordinated services distributed
Storm Flow computing framework
Flume A highly available, highly reliable, distributed massive log collection, aggregation and transmission system
Ambari Hadoop rapid deployment tools that support supply Apache Hadoop cluster management and monitoring
Kafka A high-throughput distributed publish-subscribe messaging system that can handle all the action streaming data consumer-scale site
Spark Analogous to general parallel frame of Hadoop MapReduce

Hadoop cluster deployment

Hadoop installation:

  • Stand-alone Mode: Hadoop non-distributed mode the default mode (local mode), you can run without the need for additional configuration.
  • 伪分布式模式:Hadoop 可以在单节点上以伪分布式的方式运行,Hadoop 进程以分离的 Java 进程来运行,节点既作为 NameNode 也作为 DataNode,同时,读取的是 HDFS 中的文件。
  • 分布式模式:使用多个节点构成集群环境来运行 Hadoop

Hadoop框架中最核心的设计是为海量数据提供存储的 HDFS 和对数据进行计算的 MapReduce。MapReduce的作业主要包括:

  1. 从磁盘或从网络读取数据,即IO密集工作
  2. 计算数据,即CPU密集工作

Hadoop集群的整体性能取决于CPU、内存、网络以及存储之 间的性能平衡。因此运营团队在选择机器配置时要针对不同 的工作节点选择合适硬件类型。

一个基本的 Hadoop 集群中的节点主要有:

  • NameNode:负责协调集群中的数据存储
  • DataNode:存储被拆分的数据块
  • JobTracker:协调不同机器上数据计算任务
  • TaskTracker:负责执行由 JobTracker 指派的任务
  • SecondaryNameNode:帮助 NameNode 收集文件系统运行的状态信息

在集群中,大部分的机器设备是作为 Datanode和 TaskTracker 工作的。NameNode提供整个 HDFS 文件系统的 NameSpace(命名 空间)管理、块管理等所有服务,很多元数据是直接保存在 内存中的,因此需要更多的RAM,与集群中的数据块数量 相对应,并且需要优化 RAM 的内存通道带宽。SecondaryNameNode在小型集群中可以和 NameNode 共用一台机器,较大的群集可以采用与NameNode相同的硬件。

Hadoop集群规模可大可小,初始时,可以从一个较小规模的集群开始,比如包含10个节点,然后,规模随着存储器和计算需求的扩大而扩大。

For a small cluster, a node name (the NameNode) and JobTracker running on a single node, is generally acceptable. However, with the increase in the number of files stored in HDFS clusters and the name of the node requires more main memory, time, name of the node and JobTracker needs to run on different nodes.

The second node name (SecondaryNameNode) and the name of the node can be running on the same machine, however, since the second node names and node names have almost the same main memory requirements, therefore, preferably run on two different nodes.

Common Hadoop cluster structure consists of a two-stage network. Each rack (Rack) 30-40 servers, a switch arranged 1GB, and to transfer the core to a switch or router (1GB or more).
d24

Published 61 original articles · won praise 25 · views 7180

Guess you like

Origin blog.csdn.net/qq_42582489/article/details/105054777