HDFS (Hadoop Distributed File System) component architecture overview

1.hadoop1.x and hadoop2.x difference

2. Components Introduction

Architecture Overview HDFS
) NameNode (nn) 1:
  metadata storage file, and the block, such as block lists DataNodes file name, directory structure, file attributes (generation time, number of copies, file permissions), and the like of each file located .
2) DataNodes (DN):
  the local file system stores file data block, and the block data and parity.
. 3) SecondaryNameNode (2NN):
  to monitor daemon auxiliary HDFS state, intervals metadata acquiring DHFS snapshots.

YARN Architecture Overview

1) ResourceManager (RM):
  a client request
  to monitor NodeManager
  start or monitoring ApplicationMaster
  allocation and scheduling of resources
2) NodeManager (NM):
  manage resources on a single node
  process commands from ResourceManger of
  process commands from ApplicationMaster of
3) ApplicationMaster (AM ):
  cut responsible for data points
  apply to applications and resources assigned to the task of internal
  monitoring and fault-tolerant task
4) Container:
  Container is YARN resource abstraction that encapsulates the multi-dimensional resource on a node, such as memory, CPU , disk, network, etc.

Architecture Overview MapReduce
MapReduce the calculated two-stage process: Map and the Reduce
  . 1) Map the input stage of the parallel processing of data
  2) Reduce the Map stage Summarizing

3. Big Data Technology Ecosystem

 

FIG involved in the technical terms are explained as follows:
. 1) Sqoop: Sqoop is an open source tool, used mainly between Hadoop, Hive and traditional database (MySql) for transmitting data can be a relational database (e.g.: MySQL, Oracle, etc. guide data) is Hadoop into the HDFS, it is also possible to enter data HDFS leads to a relational database.
2) Flume: Flume Cloudera is provided to a highly available, highly reliable, distributed massive log collection, aggregation and transmission system, The Flume all types of supports custom data sender log system for collecting data; while , Flume provide simple data processing, and data written to various recipients (customizable) capabilities.
3) Kafka: Kafka is a high throughput distributed publish-subscribe messaging system, has the following characteristics:
(1) providing a message by O (1) The disk data structures persistence, in this arrangement even when the number of TB for message storage stability can be maintained long.
(2) high throughput: even a very ordinary hardware Kafka can support millions of messages per second.
(3) support to partition messaging server by Kafka and consumption of machine clusters.
(4) supports Hadoop parallel data loading.
4) Storm: Storm for "continuous computing", the data flow to make continuous query, the result will be output to a user in a stream in the calculation.
5) Spark: Spark is the most popular open source big data memory computing framework. It may be calculated based on the large data storage Hadoop.
6) Oozie: Oozie is a job management Hdoop (Job) dispatch workflow management systems.
7) Hbase: HBase is a distributed, column-oriented open-source database. HBase Unlike relational database, it is adapted to a database of unstructured data store.
8) Hive: Hive is based on Hadoop data warehousing tools, you can map the structure of the data file to a database table, and provide a simple SQL query capabilities, you can convert SQL statements to run MapReduce tasks. The advantage is the low cost of learning, you can quickly achieve a simple MapReduce statistics by type of SQL statements, without having to develop specialized MapReduce applications, data warehouse is very suitable for statistical analysis.
10) R Language: R is the used for statistical analysis, mapping language and operating environment. R is part of a free GNU system, free, open source software, it is an excellent tool for statistical computing and statistical mapping for.
11) Mahout: Apache Mahout is a scalable machine learning and data mining library.
12) ZooKeeper: Zookeeper Google's Chubby is an open source implementation. It is a reliable and harmonized system for large distributed systems, available features include: configuration maintenance, name services, distributed synchronization, group services. ZooKeeper goal is to better encapsulate complex error-prone critical services, the easy to use interface and efficient performance, function and stability of the system to the user.

4. Recommended system architecture project

Guess you like

Origin www.cnblogs.com/linyouyi/p/11456685.html