Basic information about Hadoop

concept

Hadoop is an open source software framework implemented in Java language under Apache, and it is a software platform for storing and computing large-scale data.

frame content

narrow interpretation

core components

  • HDFS Distributed File System- Solve Massive Data Storage
  • MAPREDUCE Distributed Computing Programming Framework - Solving Massive Data Computing
  • Framework for YARN job scheduling and cluster resource management - solving resource task scheduling

broad interpretation

Hadoop ecosystem

  • HUE - Graphical user interface for operating and developing Hadoop applications.
  • Kafka - Big data message queue.
  • Oozie - An open source workflow scheduling engine for the Hadoop platform. It is used to manage Hadoop jobs.
  • Spark - Unified analytics engine for large-scale data processing. Analysis of real-time data or offline data.
  • Flink - Streaming analytics for real-time data.
  • Sqoop - Data migration job, acts as an ETL tool.
  • Hive - Data warehouse that can be used for offline data analysis.
  • Zeppelin - An interactive development system that can perform visual analysis of big data, and can undertake tasks such as data access, data discovery, data analysis, data visualization, and data collaboration.
  • Drill - Low latency SQL query engine for Hadoop and NoSQL. Low-latency distributed massive data (covering structured, semi-structured and nested data) interactive query engine.
  • Mahout - Provides some scalable implementations of classic algorithms in the field of machine learning, designed to help developers create smart applications more easily and quickly.
  • Tajo - a distributed data warehouse system, implemented based on Hadoop, characterized by low latency, high scalability, and provides dedicated query and ETL tools.
  • Avro - A data serialization system designed for applications that support high-volume data exchange.
  • Pig - Hadoop-based large-scale data analysis platform, the SQL-like language it provides is called Pig Latin, and the compiler of this language will convert SQL-like data analysis requests into a series of optimized MapReduce operations. Pig provides a simple operation and programming interface for complex massive data parallel computing.
  • Impala - a new, open-source MPP query engine built on top of hadoop, providing low-latency, high-concurrency read-oriented queries.
  • Tez - An open source computing framework that supports DAG jobs. It can convert multiple dependent jobs into one job to greatly improve the performance of DAG jobs. calculation engine.
  • Zookeeper - A distributed, open source coordination service for distributed applications.
  • Hbase - A highly reliable, high performance, column-oriented, scalable distributed storage system.
  • Cassandra - An open source distributed database management system developed by Facebook for storing extremely large data.
  • Redis - High performance K-Vs database.
  • Chukwa - An open source data collection system for monitoring large distributed systems.
  • Mesos - Open source distributed resource management framework, which is known as the kernel of distributed systems.
  • Yarn - responsible for resource management (resource management) and task scheduling and monitoring (scheduling/monitoring) in the Hadoop cluster.
  • MapReduce - A programming model for parallel operations on large-scale datasets (greater than 1TB).
  • Flume - A highly available, highly reliable, and distributed massive log collection, aggregation, and transmission system.
  • Hdfs - Distributed File System designed to run on commodity hardware.
  • Ambari - A web-based tool that supports provisioning, management, and monitoring of Apache Hadoop clusters.

Hadoop configuration file

configuration file name configuration object main content
core-site.xml Cluster global parameters Used to define system-level parameters, such as HDFS URL, Hadoop temporary directory, etc.
hdfs-site.xml HDFS parameters Such as the storage location of the name node and data node, the number of file copies, and file read permissions
mapred-site.xml Mapreduce parameters Including JobHistory Server and application parameters, such as the default number of reduce tasks, the default upper and lower limits of memory that tasks can use, etc.
yarn-site.xml Cluster resource management system parameters Configure ResourceManager, NodeManager communication port, web monitoring port, etc.

Hadoop HDFS

overview

HDFS definition

HDFS (Hadoop Distributed File System) is a file system for storing files and locating files through a directory tree; secondly, it is distributed, and many servers are combined to realize its functions. The servers in the cluster have their own roles .

HDFS usage scenarios: suitable for one-time write, multiple read-out scenarios, and does not support file modification. It is suitable for data analysis, but not suitable for network disk applications.

advantage

  • High fault tolerance, multiple copies of data are automatically saved. After a copy is lost, it can be automatically restored.
  • Suitable for processing big data.
  • On cheap machines, reliability can be improved through multiple replica mechanisms.

shortcoming

  • Not suitable for low-latency data access.
  • It cannot efficiently store a large number of small files.
  • Concurrent writing and random modification of files are not supported.

HDFS composition architecture

NameNode

Master, it is a supervisor, a manager.

  • Manage HDFS namespaces
  • Configure replica policy
  • Manage data block (Block) mapping information
  • Handle client read and write requests

DataNode

Slave, NameNode issues orders, and DataNode performs actual operations.

  • store the actual data block
  • Perform read/write operations on data blocks

Client client

  • File segmentation. When a file is uploaded to HDFS, the Client divides the file into blocks and uploads them
  • Interact with NameNode to obtain the location information of the file
  • Interact with DataNode, read or write data
  • Client provides some commands to manage HDFS, such as NameNode formatting
  • Clientk can access HDFS through some commands, such as adding, deleting, checking and modifying HDFS

Secondary NameNode

Not a Hot Standby for NameNode. When the NameNode hangs up, it cannot immediately replace the NameNode and provide services.

  • Auxiliary NameNode, sharing its workload
  • In an emergency, it can assist in recovering the NameNode

HDFS file blocks

overview

The storage of files in the Hadoop cluster is stored in HDFS in the form of blocks.

Defaults

Starting from version 2.7.3, the default size of the block size is 128M, and the default value of the previous version is 64M.

How to modify the size of the block block?

You can modify the value corresponding to dfs.blocksize in the hdfs-site.xml file.

Note: When modifying the data block size of HDFS, first stop the running process of the cluster hadoop, and then restart it after the modification.

block block size setting rules

In practical applications, what is the appropriate size of the hdfs block? Why some are 64M, some are 128M, 256M, 512?

First, let's understand a few concepts:

1) Seeking time: the time it takes to find the target file block in HDFS.

2) Principle: The larger the file block, the shorter the seek time, but the longer the disk transfer time; the smaller the file block, the longer the seek time, but the shorter the disk transfer time.

The block cannot be set too large, nor can it be set too small

  • If the block is set too large, on the one hand, the time to transfer data from the disk will be significantly longer than the seek time, causing the program to become very slow when processing this data; on the other hand, the map task in MapReduce usually only processes one block at a time If the data in the block is too large, the running speed will be very slow.
  • If the setting is too small, on the one hand, storing a large number of small files will occupy a large amount of memory in the NameNode to store metadata, and the memory of the NameNode is limited, which is not advisable; The starting position of the block. Therefore, the block is properly set larger to reduce the seek time, so the time to transfer a file consisting of multiple blocks mainly depends on the transfer speed of the disk.

How appropriate is it?

1) The average seek time in HDFS is about 10ms;

2) After a lot of tests by the predecessor, it is found that when the addressing time is 1% of the transmission time, it is the best state, so the best transmission time is:

10ms/0.01=1000s=1s

3) The current disk transmission speed is generally 100MB/s, the optimal block size calculation:

00MB/s*1s=100MB

So we set the block size to 128MB.

4) In practice, when the disk transfer rate is 200MB/s, the block size is generally set to 256MB; when the disk transfer rate is 400MB/s, the block size is generally set to 512MB.

Shell operation of HDFS

Usage: hadoop fs [generic options]
        [-appendToFile <localsrc> ... <dst>]
        [-cat [-ignoreCrc] <src> ...]
        [-checksum <src> ...]
        [-chgrp [-R] GROUP PATH...]
        [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
        [-chown [-R] [OWNER][:[GROUP]] PATH...]
        [-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>]
        [-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
        [-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ...]
        [-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
        [-createSnapshot <snapshotDir> [<snapshotName>]]
        [-deleteSnapshot <snapshotDir> <snapshotName>]
        [-df [-h] [<path> ...]]
        [-du [-s] [-h] [-v] [-x] <path> ...]
        [-expunge [-immediate]]
        [-find <path> ... <expression> ...]
        [-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
        [-getfacl [-R] <path>]
        [-getfattr [-R] {
    
    -n name | -d} [-e en] <path>]
        [-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
        [-head <file>]
        [-help [cmd ...]]
        [-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ...]]
        [-mkdir [-p] <path> ...]
        [-moveFromLocal [-f] [-p] [-l] [-d] <localsrc> ... <dst>]
        [-moveToLocal <src> <localdst>]
        [-mv <src> ... <dst>]
        [-put [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>]
        [-renameSnapshot <snapshotDir> <oldName> <newName>]
        [-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...]
        [-rmdir [--ignore-fail-on-non-empty] <dir> ...]
        [-setfacl [-R] [{
    
    -b|-k} {
    
    -m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
        [-setfattr {
    
    -n name [-v value] | -x name} <path>]
        [-setrep [-R] [-w] <rep> <path> ...]
        [-stat [format] <path> ...]
        [-tail [-f] [-s <sleep interval>] <file>]
        [-test -[defswrz] <path>]
        [-text [-ignoreCrc] <src> ...]
        [-touch [-a] [-m] [-t TIMESTAMP ] [-c] <path> ...]
        [-touchz <path> ...]
        [-truncate [-w] <length> <path> ...]
        [-usage [cmd ...]]

HDFS data flow

write data flow

1. The client requests the NameNode to upload files through the Distributed FileSystem module, and the NameNode checks whether the target file exists and whether the parent directory exists.

2. NameNode returns whether it can be uploaded

3. Which DataNode servers the client requests to upload the first Block to.

4. The NameNode returns three DataNode nodes, namely dn1, dn2, and dn3.

5. The client requests dn1 to upload data through the FSDataOutputStream module. After receiving the request, dn1 will continue to call dn2, and then dn2 will call dn3 to complete the establishment of the communication channel j.

6. dn1, dn2, and dn3 respond to the client step by step.

7. The client starts to upload the first block to dn1 (first read the data from the disk and put it in a local memory cache). Taking Packet as the unit, dn1 will pass a Packet to dn2, and dn2 will pass it to dn3; dn1 does not Passing a Packet will be put into a response queue to wait for the response.

8. When a block transmission is completed, the client requests the NameNode to upload the server of the second block again. (Repeat steps 3-7).

Read data flow

1. HDFS requests the NameNode through the Distributed FileSystem module to obtain the beginning part or the entire block list of the file

2. NameNode returns the Block list

3. Client Node reads data from the nearest DataNode.

4. Client Noded calls the read() method

5. Find the DataNode closest to the ClientNode and connect to the DataNode to read

HDFS uses near read.

MapReduce of Hadoop

overview

definition

MapReduce is a programming framework for distributed computing programs and the core framework for users to develop "data analysis applications based on Hadoop".

The core function of MapReduce is to integrate the business logic code written by users and its own default components into a complete distributed computing program, which runs concurrently in a Hadoop cluster.

Notice

  • Not good at real-time computing
  • Not good at streaming computing
  • Not good at DAG (directed graph) calculation

MapReduce core programming ideas

1) The MapReduce operation program generally needs to be divided into two stages: the Map stage and the Reduce stage.

2) The concurrent MapTask in the Map stage runs completely in parallel and is independent of each other

3) The concurrent ReduceTasks in the Reduce stage are completely irrelevant to each other, but their data depends on the output of all MapTask concurrent instances in the previous stage

4) The MapReduce programming model can only contain one Map stage and one Reduce stage. If the user's business logic is very complex, only multiple MapReduce programs can be run serially.

MapReduce process

A complete MapReduce program has three types of instance processes during distributed runtime

  • MrAppMaster is responsible for the process scheduling and status coordination of the entire program
  • MapTask is responsible for the entire data processing process in the Map phase
  • ReduceTask is responsible for the entire data processing process of the Reduce phase

MapReduce programming specification

Mapper stage

1. The user-defined Mapper should inherit its own parent class

2. The input data of Mapper is in the form of KVd pairs (the type of KV can be customized)

3. The business logic in Mapper is written in the map() method

4. The output data of Mapper is in the form of KV pairs (the type of KV can be customized)

5. The map() method (MapTask process) is called once for each <K, V>

Reducer stage

1. The user-defined Reducer should inherit its own parent class

2. The input data type of Reducer corresponds to the output data type of Mapper, which is also KV

3. The business logic of the Reducer is written in the reduce() method

4. The ReduceTask process calls the reduce() method once for each <K, V> group with the same K

Driver stage

Equivalent to the client of the YARN cluster, it is used to submit our entire program to the YARN cluster, and what is submitted is the job object that encapsulates the relevant operating parameters of the MapReduce program

Hadoop serialization

Serialization is to convert objects in memory into byte sequences (or other data transfer protocols) for storage to disk (persistence) and network transmission.

Deserialization is to convert received byte sequences (or other data transfer protocols) or persistent data on disk into objects in memory.

Hadoop has developed a serialization mechanism (Writable) by itself.

  • Compact, efficient use of storage space
  • Fast, with little overhead for reading and writing data
  • Scalable, can be upgraded with the upgrade of the communication protocol
  • Interoperability, support multilingual interaction

Principle of MapReduce framework

Data slicing and MapTask parallelism determination mechanism

  • The parallelism of the Map stage of a job is determined by the number of slices when the client submits the job
  • Each Split slice is assigned a MapTask parallel instance processing
  • By default, slice size = BlockSize
  • When slicing, the entire dataset is not considered, but each file is sliced ​​individually

FileInputFormat slice source code analysis

1、程序先找到数据存储目录
2、遍历处理目录下的每一个文件
3、遍历第一个文件ss.txt
    a、获取文件大小,fs.sizeOf(ss.txt)
    b、计算切片大小
        computeSliteSize(Math.max(minSize,Math.min(maxSize,blockSize)))=blocksize = 128M
    c、默认情况下,切片大小=blocksize
    d、开始切,形成第一个切片 ss.txt - 0-128M
                   第二个切片 128 - 2556M
                   第三个切片 256 - 300M
        (每次切片时,都要判断切完剩下的部分是否大于块的1.1倍,不大于1.1倍就划分一块切片)
    e、将切片信息写到一个切片规划文件中
    f、整个切片的核心过程在getSplit()方法中完成
    g、InputSplit只记录了切片的元数据信息,比如起始位置、长度以及所在的节点列表等。
4、提交切分规划文件到YARN上,YARN上的MrAppMaster就可以根据切片规划文件计算开启MapTask个数。

CombineTextInputFormat slicing mechanism
framework The default TextInputFormat slicing mechanism is to slice tasks according to file planning. No matter how small the file is, it will be a separate slice and will be handed over to a MapTask. In this way, if there are a large number of small files, a large number of MapTasks will be generated for processing. Extremely inefficient.

1、应用场景
CombineTextInputFormat用于小文件过多的场景,它可以将多个小文件从逻辑上规划到一个切片中,
这样多小的文件就可以交给一个MapTask处理。

2、虚拟存储切片最大值设置
CombineTextInputFormat.setMaxInputSplitSize(job,4194304); //4M
注意:虚拟存储切片最大值设置最好根据实际的小文件大小情况来设置具体的值。

Each line of KeyValueTextInputFormat
is a record, which is divided into key and value by delimiter. The separator can be set by setting conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR,"\t"); in the driver class. The default delimiter is tab(\t).

NLineInputFormat
If NLineInputFormat is used, it means that the InputSplit processed by each map process is no longer divided by Block blocks, but divided by the number of lines N specified by NLineInputFormat. That is, the total number of lines of the input file/N=the number of slices, if not divisible, the number of slices=quotient+1.

//Set three records in each slice InputSplit
NLineInputFormat.setNumLinesPerSplit(job,3)

Shuffle mechanism

After the Map method, the data processing process before the Reduce method is called Shuffle.

Partition

It is required to output statistical results to different files (partitions) according to conditions.

custom partition

自定义分区Partitioner步骤
1)自定义类继承Partitioner,重写getPartition()方法
public class CustomPartitioner extends Partitioner<Text,FlowBean>{
    public int getPartition(Text key,FlowBean value,int numPartitions){
        //逻辑代码
        
        return partition;
    }
}

2)在Job驱动中,设置自定义Partitioner
job.setPartitionerClass(CustomPartitioner.class);

3)自定义Partition后,要根据自定义Partitioner的逻辑设置相应数量的ReduceTask
job.setNumReduceTasks(5);

Partition summary

  • If the number of ReduceTasks > the number of results of getPartition, several empty output files part-r-000xx will be generated;
  • If 1<the number of ReduceTasks=1, <the number of results of getPartition, some partition data has nowhere to be placed, and Exception will occur
  • If the number of ReduceTasks = 1, no matter how many partition files are output by the MapTask end, the final result will be handed over to this ReduceTask, and only one result file part-r-00000 will be generated in the end;
  • Partition numbers must start from zero and add up one by one.

to sort

Both MapTask and ReduceTask sort data by key. This operation is the default behavior of Hadoop. Data in any application is sorted regardless of logical need.

The default sorting is lexicographical sorting, and the method to achieve this sorting is quicksort.

Sort by category

1) Partial sorting: MapReduce sorts the dataset based on the keys of the input records. Ensure that each output file is internally ordered.

2) Full sorting: the final result is only one file, and the file is in order. The way to achieve it is to set only one ReduceTask. But this method is extremely inefficient when dealing with large files, because one machine processes all files, completely losing the parallel architecture provided by MapReduce.

3) Auxiliary sorting: group keys on the reduce side. Applicable to: When the received key is a bean object, you can use group sorting when you want keys with one or several fields that are the same (all fields are different) to enter the same reduce method.

4) Secondary sorting: In the custom sorting process, if the judgment condition in compareTo is two, it is secondary sorting.

5) Custom sorting: Bean needs to implement the WritableComparble interface and rewrite the compareTo method to achieve sorting

the case

public class FlowBean implements WritableComparble{
    public int compareTo(FlowBean bean){
        int result;
        
        if(sumFlow > bean.getSumFlow()){
            result = -1;
        }else if(sumFlow < bean.getSumFlow()){
            result = 1;
        }else{
            result = 0;
        }
        
        return result;
    }
}

Combiner

(1) Combiner is a component other than Mapper and Reducer in the MR program.

(2) The parent class of the Combiner component is the Reducer.

(3) The difference between Combiner and Reducer is the location of operation:

Combiner runs on the node where each MapTask is located;

Reducer is to receive the output of all Mappers globally;

(4) The significance of the Combiner is to locally summarize the output of each MapTask to reduce the amount of network transmission.

(5) The premise that Combiner can be applied is that it cannot affect the final business logic, and the output kv of Combiner should correspond to the input kv type of Reducer.

group sort

1. The custom class inherits WritableComparaator

2. Rewrite the compare method

3. Create a structure to pass the class of the comparison object to the parent class

Reduce Join

The main work of the Map side: label the key/value pairs from different tables or files to distinguish records from different sources. Then use the connection field as the key, the rest and the newly added flag as the value, and finally output.

The main work of the Reduce side: the grouping with the connection field as the key on the Reduce side has been completed, we only need to separate the records from different files (
marked up.

Map Join

Map Join is suitable for scenarios where one table is very small and one table is very large.

Development summary

** Input data interface InputFormat**

1、默认使用实现类 TextInputFormat

2、TextInputFormat功能逻辑是:
    一次读一行文本,然后将该行的起始编译量作为Key,行内容作为value返回

3、KeyValueTextInputFormat每一行均为一条记录,被分隔符分割为key,value。默认分隔符是tab

4、NlineInputFormat按照指定的行数N来划分切片

5、CombineTextInputFormat可以把多个小文件合并成一个切片处理,提高处理效率。

6、用户还可以自定义InputFormat

Logic processing interface Mapper

用户根据业务需求实现其中三个方法
    map() setup() cleanup()

Partitioner partition

默认实现HashPartitioner,逻辑是根据key的哈希值和numReduces来返回一个分区号

用户可自定义分区号

Comparable sort

1、我们用自定义的对象作为key来输出时,就必须要实现WritableComparable接口,重写其中compareTo()方法
2、部分排序:对最终输出每一个文件进行内部排序
3、全排序:对所有数据进行排序,通常只有一个Reduce
4、二次排序:对排序的条件有两个

Combiner

Combiner合并可以提高程序执行效率,减少IO传输,但是使用时必须不能影响原有的业务处理结果。

Reduce side grouping GroupingComparator

在Reduce端对key进行分组。应用于:在接收的key为bean对象时,想让一个或几个字段相同(全部字段比较不相同)的key进入到同一个reduce方法时,可以使用分组排序。

Logic processing interface Reducer

用户根据业务需求实现 reduce() setup() cleanup()

Output data interface OutputFormat

1、默认实现类是TextOutputFormat,功能逻辑是:将每一个KV对,向目标文本文件输出一行

2、将SequenceFileOutputFormat输出作为后续MapReduce任务输入。

3、用户可自定义OutputFFormat

YARN for Hadoop

overview

Yarn is a resource scheduling platform, which is responsible for providing server computing resources for computing programs, which is equivalent to a distributed operating system platform, while computing programs such as MapReduce are equivalent to applications running on the operating system.

Guess you like

Origin blog.csdn.net/flash_love/article/details/131809496