Hadoop of Internet of Things Architecture

Hadoop of Internet of Things Architecture

1. What is big data

    Big data refers to the collection of large amounts of data that cannot be captured, managed and processed by conventional software tools within a certain period of time. Growth rate and diversification of information asset data.
The large amount of data is the salient feature of big data as follows:
    (1) The data size is huge, reaching PB or even EB level.
    (2) There are various data types of big data, mainly unstructured data.
    (3) Low value density. Valuable data only accounts for a relatively small part of the total data.
    (4) The generation and request processing speed is fast. This also requires
    Technologists possess the ability to rapidly obtain valuable information from various types of data.

2. Overview of hadoop

    Hadoop is open source software that implements a distributed file system (Hadoop Distributed File System, HDFS). A distributed system is a software system that runs on multiple hosts.
HDFS has the following characteristics:
    High fault tolerance: It can automatically save multiple copies of data and automatically reassign failed tasks.
    Low cost: Deployed on a cheap general-purpose hardware platform to form a cluster.
    High expansion: provide hot-swappable way to add new nodes to expand to the cluster
    Efficiency:

3. Hadoop Architecture

(1) The core composition of hadoop

    HDFS and MapReduce are the two cores of Hadoop. The underlying support for distribution is realized through HDFS, achieving high-speed parallel reading and writing and large-capacity storage expansion. Realize support for distributed parallel task processing programs through MapReduce to ensure high-speed analysis and processing of data. HDFS provides support for file operation and storage in the process of MapReduce task processing. MapReduce implements task distribution, tracking, execution, etc. on the basis of HDFS, and collects results. The interaction between the two completes the Hadoop distributed cluster main task.
    Why MapReduce? For the calculation of large amounts of data, the processing method usually adopted is parallel computing. This requires the ability to decompose large and complex computing problems into subtasks, and assign them to multiple computing resources to perform calculations at the same time. Its notable feature is that it takes less time than calculations under a single computing resource. For most developers, parallel computing is still a strange and complicated thing, especially when it comes to distributed issues, it will be even more difficult. MapReduce is a programming type that implements parallel computing. It provides an interface to users and shields many details of parallel computing, especially distributed processing, so that developers who do not have much parallel computing experience can easily develop parallel applications. Parallel computing refers to multiple servers (distributed) working at the same time, writing data in blocks at the same time. MapReduce is a combination of two concepts: map (mapping) and reduce (reduction). map is responsible for decomposing tasks into multiple tasks, and reduce is responsible for summarizing the processing results of the decomposed multi-tasks.

4. Overview of MapReduce

    In Hadoop, the development of parallel applications is based on the MapReduce programming model, based on which tasks can be distributed to clusters consisting of thousands of commercial machines to realize the parallel task processing function of Hadoop.

5.. MapReduce framework design

    So how does the MapReduce program work? When the MapReduce program is written and configured as a MapReduce job (Job). The "job" here can be understood as: After writing a MapReduce program for a distributed computing task, submit the program to the MapReduce execution framework and execute it. When the client submits the Job to the JobTracker. From input to final output, it goes through the following five stages:
(1) input: The Job is created by the JobTracker, and the input split (Input Split) is calculated according to the input of the Job. Here it is required that the data set to be processed must be decomposed into many small data sets, and each small data set can be processed completely in parallel. The number of files in the input directory determines the number of shards, and if the HDFS default block size (64MB) is exceeded for a single file, it will be split by block size.
(2) split: The job scheduler obtains the input fragmentation information of the Job, and parses the records in the input fragmentation into key-value pairs according to certain rules. The "key" (key) is the starting position of each line, expressed in bytes Unit, "value" (value) is the text content of this line. Finally, each shard creates a MapTask and assigns it to a TaskTracker.
(3) map: TaskTracker starts to execute MapTask and processes each input key-value pair. How to handle it depends on the program code at this stage. After the processing is completed, a new key-value pair is generated and stored locally.
(4) shuffle: shuffle. The output of the MapTask is effectively used as the input process of the ReduceTask. It can be seen from the figure that the process will exchange data between TaskTracker Nodes and group them by key.
(5) reduce: read the output of the shuffling stage, start to execute ReduceTask, and process each key-value pair input. Again, how this is handled depends on the program code at that stage. Finally output the final result. In Hadoop, each MapReduce computing task will be initialized as a Job. There are mainly two processing stages: map stage and reduce stage. Both stages are key-value pairs
(1) yarn-env.sh: add JDK path.
(2) mapred-site.xml: Specify mapreduce.framework.name as yarn.
(3) yarn-site.xml: YARN specific configuration information, detailed
(1) The MapReduce framework receives the job submitted by the user, assigns it a new application ID, packages and uploads the application definition to the user's application cache directory on HDFS, and then submits the application to the application manager.
(2) The application manager negotiates with the scheduler to obtain the first resource container needed to run the application body.
(3) The application manager executes the application body on the obtained resource container.
(4) The application subject calculates the resources required by the application and sends resource requests to the scheduler.
(5) The scheduler allocates appropriate resource containers to the application subject according to the available resource status of its own statistics and the resource request of the application subject.
(6) The application subject communicates with the node manager of the allocated container, and submits job status and resource usage instructions.
(7) The node manager enables the container and runs the task.
(8) The application subject monitors the execution of tasks on the container.
Hadoop has three modes of operation: stand-alone mode, pseudo-distributed and fully distributed:
    Stand-alone mode: No configuration is required, and Hadoop is regarded as an independent Java process running in non-distributed mode.
    Pseudo-distributed: A cluster with only one node. This node is both a Master (master node, master server) and a Slave (slave node, slave server). Different Java processes can be used to simulate various nodes in the distribution on this single node. .
    Fully distributed: For Hadoop, different systems have different node division methods. From the perspective of HDFS, it is divided into NameNode (manager) and DataNode (worker), of which there is only one NameNode, and there can be multiple DataNodes; from the perspective of MapReduce, the nodes are divided into JobTracker (job scheduler) and TaskTracker (task executor) , where there is only one JobTracker, and there can be multiple TaskTrackers. NameNode and JobTracker can be deployed on different machines or on the same machine. The machine where NameNode and JobTracker are deployed is Master, and the rest of the machines are Slave.
Both stand-alone mode and pseudo-distributed mode cannot reflect the advantages of cloud computing, and are usually used for program testing and debugging.

Guess you like

Origin blog.csdn.net/qq_53195102/article/details/115624497