2019/08/09 Hadoop basics (02)

Insert picture description hereMap and reducer, the data is interactive, and the results of the map data processing will be sent to the reducer later. The start time of the reducer can be defined by yourself. When the user submits the job, there is no strict time rule for starting the map and reducer. , The general mapper task ends, the reducer can start, the mapper can send the intermediate processing results to the reducer, if the startup is late, the results generated by the mapper may need to be saved locally now the
reducer can start the
mapper processing with the mapper to process the intermediate results Need to send the same key to the same reducer, if each server has a cross, you need to combine the mapreducer framework to achieve which key to send to which reducer, most of them are determined by the programmer, and starting several reducers can be done by the mapreducer The
Hadoop defined by the programmer can implement task parallel processing. When to run what task depends on what data you have, secondly, find a mapreducer or hadoop developer. This mapreducer submits the program to the jobtracker and starts to execute shuffle and sort
mapper and reducer running What tasks are defined by the programmer, hadoop developers are used to do this thing, what a task mapper that depends on your programmers to develop a program
for each data as a key first and then processed by the mapper, How to deal with the key value requires the programmer to write a code definition before writing the mapper, how to extract the KV of the original data, the reducer may run on other nodes, how to decide on which node to run, and which reducer to send the kv data to the programmer also needs the programmer to come Definition,
Insert picture description here
define a program to form a partitioner (partitioner) on each mapper, and how to execute the generated key value result, defined by the partitioner
The first key of ik1 is sent to the first reducer, and the ik3 is sent to the third reducer. The partitioner is responsible for sending the key of the reducer to the corresponding reducer through the process of shuffle and sort, and the partitioner is written by the programmer. The
Insert picture description here
mapper reads the key-value pairs and generates the key-value pairs. This result is sent to the reducer and folded again to get the only result. The mapper may have more than
one key . All the same keys in the mapper must be passed to the same reducer. The program is responsible for completion, the partitioner (responsible for sending a certain key to a certain reducer), all ik1 are
merged to the first reducer combiner, otherwise they will be distributed and distributed,
such as the fourth combiner 3 + 6 ik2 = 9, The combiner is still the mapperducer programmer ’s
mapper read key and output key, and the reducer read key and output key can be different. Although the combiner can be processed, the keys for reading and input must be the same. There can only be a simple merge function
. The key specified by the primer is sent to the specified reducer. If
Insert picture description here
there are multiple reducers, there are multiple output streams. The output of each reducer is only Part of the resultInsert picture description here

Big data is originally stored on different nodes of HDFS. The tasks started by three mappers are on 3 shards. The rest can only be copied. If the first node has shards, the second node also has shards. , But the task started by the mapper is only on the first node, the data fragment needs to be copied to that node, the Insert picture description here
mapper reads the kv output kv, after shuffle and sort, each mapper completes the sort locally, sort, sort finished In the future, copying starts, and the copying method is passed to each reducer. The same key is passed to a reducer. The mapper forms the same data stream and is folded by the reducer.
The processing results of a part of the data reducer are placed on HDFS. Once stored on the HDF, Each data should have
a shard and a copy. This is the
Insert picture description here
request submitted



by the mapreducer client in hadoop1 , called job, and the hadoop mapreduce master is called jobtracker. The submitted job (map is a java process, should be run in a jvm, and reduce is also so) Map instance is responsible for reading the data starts to generate key input data from the data in, reduce the job parts responsible for generating reduce instance Examples of data received from the map reduce kv, is stored back folded after generating outputdata the HDFS Master tasks should client into two parts, map, reduce map several start, run on which the node is determined by jobtracker, reduce startup by several The programmer decides that it can also be decided by the jobtracker. The hadoop mapreduce master should track the history of the job history in the hadoop Insert picture description here
active jobs active jobs historical jobs historical jobs
The active job is divided into map task.
The number of any task tracker of reduce task is limited (if there are 4 task slots), it is considered that task 3 is full, and then submit any job without running
job tracker 2 roles, responsible Distributing jobs can easily become a performance bottleneck

To hadoop 2
Insert picture description here
MR mapreduce
MRv1 (Hadoop2)-> MRv2 (Hadoop2)
MRv1: Cluster resource manager, cluster resource management Data processing data processing
(mapreduce can run programs, pig, hive, other)
MRv2:
YARN: Cluster resourse manager
MRv2 : Data processing only retains the data processing function
MR: batch only implements batch processing (becomes the same operation method as PIG and HIVE)
Tez: execution engine is responsible for the execution engine
RT STREAM GRAPH real-time streaming processing
version 1 mapreduce to version 2 It is cut into multiple functions, and what is cut out is called the public function framework

RM: Resource Manager
NM: Node Manager

AM: Application Master
container: mr task;
Insert picture description hereInsert picture description here
mapreduce depends on tez to complete, or does not depend on tez (execution engine), high-performance memory spark, high-performance computing openmpi

How does the Hadoop job task run Insert picture description hereRM: Resource Manager resource manager, (started tasks, each task has its own AM, responsible for managing its own internal task tasks)
NM: Node Manager node management, management of the current node
node status report node current information

1 AM: Application Master, the master node of the application, decides to start several mappers, several reducer
containers: mr tasks; the jobs are all running in the container, (cgroups can configure the amount of available resources to the namespace) Each container may have Map or reduce, called the task in the
container, whether the container execution monitoring is continuously sent to the app master, and the number of containers started is determined by the app master

The tasks completed by jobtracker and tasktracker are divided into multiple parts. The job management (Resource Manager) and resource allocation are composed of two different components.
NM: Node Manager reports node-related information to Resource Manager periodically,
Insert picture description here
resource management and program tasks run two The task is divided, and the program is run by the app master
resource. The resource manager
resource manager asks whether the node has a free container to run the program, finds the node ’s master app master, and how to run the rest is allocated by the app master, which one needs to be started. The container on the node, the app master applies to the resource manager, the resource manager finds the node and allocates the container requested by the other party, and tells the app master after the allocation, the app master can use this container to run the job, the node manager is running reported to rm, rm report app to recover the resources
Insert picture description here in
this framework is to achieve hdoop2 of Insert picture description here
version changes
natch spider caught onto the Lucene
2006hadoop project was born
2008 data pressure test is successful, Barghouti
2009 0.20.0 born
hadoop1 has not maintained
Insert picture description here http: //hadoop.a Pache.org
hadoop2 can run batch programs, interactive interfaces, Hbase, streaming data processing, in-memory computing, high-performance computingInsert picture description here
The interface of hadoop is mapreduce. To use hadoop, write the mapreduce program (redis-cli, mongodb too).
Hive is developed by facebook and runs on yarn's tez, allowing users to complete data processing and write in the form of SQL statements. The sql statement needs to be processed by mapreduce. The job of calling mapreduce
is the batch processing system. Mapreduce is a slow process. Hive is called HQL and sql is similar, but it also needs to be learned by itself. Let the user run the task. The interface of the relational database such as mysql is the sql interface. The interface of the
linux operating system is the shell
hive little bee pig. Brother two, if you want simple and practical mapreduce, you can use pig or hive. Pig is a scripting programming language interface
HDFS storage It is streaming data, and only supports appending or
creating new files. HBASE is a large table. In the sql genre, it belongs to column storage, sql is row storage, hbase is stored by columns, and consists of rows and columns, each column is stored as a Unit, multiple columns, column group, column data, need to traverse to find the row of data, such stored data, do not consider by row Ask, when access may only access the columns belong to the group of several data
cell that can store multiple versions co-exist, can be achieved in the multi cell coexist, version history can be retained, any need some versions can be specified, hbase have
Each cell of the time machine Insert picture description here
belongs to key-value storage for each column.
Hbase ultimately works on hdfs. Any data in hbase must be stored as hdfs, trunk, and once it needs to be used, the data block should be drawn out. , Add data to the column above and write to the data block
HBase has an interface, and non-SQL also has an interface to add, delete, modify, and check. Hbase needs to create a new cluster. In order to avoid brain splitting, zookeeper is required to ensure that the cluster is operating properly. HDFS itself has redundancy capabilities.
This is the
very famous Hadoop . There are three plug-ins.

If you want to export the relational database mysql to process in the form of mr, you need a data extraction and import tool.
Data exchange works outside of Hadoop to help you extract data from other relational databases and import it into Hadoop. The data extraction on hadoop is structured and imported into a relational database, called sqoop
Insert picture description here
flume = log collector. The log collector is similar to logstash. The collected logs are imported into hbase in a streaming manner. It is an independent log collection system. The logs can be Put it in a file, you can also put it in HDFS
except pig and hive. hbase also has an OOzie = work flow workflow tool, (how to arrange your mapreduce tasks to form a workflow)
Rconnectors R is a well-known programming language, and R is a language used for statistical mathematics in
Mahout】 = machine learning machine learning, the next step is deep learning deep learning, to achieve a
Insert picture description here
large artificial intelligence ecosystem https://hadoop.apache.org/old/
Avro ™: data serialization system
Cassandra ™: A scalable multi-master database with no single points of failure. A variety of scalable databases, nosql with no central node,
Insert picture description here and a fast and common real-time calculation engine for
log collection tools
Insert picture description here
, distributed calculations, completed in memory, spark interface is a programming language
Insert picture description here
RT stream graph real-time streaming Data processing
Insert picture description here
zookeeper = coordination coordination tool
** Hadoop is best not to run on a virtual machine, io requirements are very large, especially on HDFS, Linux, each component of hadoop is made into a docker image
Ambari is a full lifecycle management tool for hadoop, and is also a distributed
installation management monitoring **
Insert picture description here
Hadoop Distribution:
Cloudera: CDH Cloudera distribution hadoop
Hortonworks: HDP Hortonworks distribution platform
Intel: IDH intel distribution hadoop
MapR

stand-alone mode pseudo-distributed mode, distributed mode

Published 252 original articles · Likes6 · Visits 60,000+

Guess you like

Origin blog.csdn.net/qq_42227818/article/details/99715686