What technical support does hadoop need

Hadoop is an open-source software framework that can be installed in a cluster of commodity machines, enabling machines to communicate with each other and work together to collectively store and process large amounts of data in a highly distributed fashion. Originally, Hadoop consisted of two main components: the Hadoop Distributed File System (HDFS) and a distributed computing engine that supported implementing and running programs as MapReduce jobs.

Hadoop also provides the software infrastructure to run MapReduce jobs as a series of map and reduce tasks. The Map task calls the map function on a subset of the input data. After completing these calls, the reduce task starts calling the reduce task on the intermediate data generated by the map function to generate the final output. The map and reduce tasks run independently of each other, which supports parallel and fault-tolerant computation.

Most importantly, the Hadoop infrastructure handles all the complex aspects of distributed processing: parallelization, scheduling, resource management, inter-machine communication, software and hardware failure handling, and more. Thanks to this clean abstraction, implementing distributed applications that process terabytes of data across hundreds (or even thousands) of machines has never been easier, even for those with no prior experience with distributed systems The same goes for developers.



 

map reduce process diagram

<!--EndFragment-->

shuffle combine

The overall Shuffle process includes the following parts: Map-side Shuffle, Sort stage, and Reduce-side Shuffle. That is to say: the shuffle process spans both ends of map and reduce, including the sort phase in the middle, which is the process of data output from map task to reduce task input.

Sort and combine are on the map side, and combine is an advanced reduce, which needs to be set by yourself.

In a Hadoop cluster, most map tasks and reduce tasks are executed on different nodes. Of course, in many cases, when Reduce is executed, it is necessary to pull map task results on other nodes across nodes. If there are many jobs running in the cluster, the normal execution of tasks will consume a lot of network resources inside the cluster. For necessary network resource consumption, the ultimate goal is to minimize unnecessary consumption. Also within the node, compared to memory, the impact of disk IO on job completion time is also considerable. From the most basic requirements, for the Shuffle process of MapReduce job performance tuning, the target expectations can be as follows:

Completely pull data from the map task side to the reduce side.

Minimize unnecessary consumption of bandwidth when pulling data across nodes.

Reduce the impact of disk IO on task execution.

Generally speaking, this Shuffle process can be optimized mainly to reduce the amount of pulled data and use memory instead of disk as much as possible.

YARN

ResourceManager replaces cluster manager

ApplicationMaster replaces a dedicated and ephemeral JobTracker

NodeManager 代替 TaskTracker

A distributed application instead of a MapReduce job

A global ResourceManager runs as the main background process, usually running on a dedicated machine, arbitrating the available cluster resources among various competing applications.

When a user submits an application, a lightweight process instance called ApplicationMaster is started to coordinate the execution of all tasks within the application. This includes monitoring tasks, restarting failed tasks, speculatively running slow tasks, and summing application counter values. Interestingly, ApplicationMaster can run any kind of task inside a container.

NodeManager is a more general and efficient version of TaskTracker. Instead of a fixed number of map and reduce slots, NodeManager has many dynamically created resource containers.


 
   

Big data Hadoop developers include Amazon Web Services, Cloudera, Hortonworks, IBM, MapR Technology, Huawei and DakuaiSearch . These vendors are all based on the Apache open source project, and then add features such as packaging, support, integration, and their own innovations .

Dakuai 's big data general computing platform ( DKH) has integrated all the components of the development framework with the same version number. If the Dakuai development framework is deployed on the open source big data framework, the components of the platform need to be supported as follows:

Data source and SQL engine: DK.Hadoop, spark, hive, sqoop, flume, kafka

Data collection: DK.hadoop

Data processing module: DK.Hadoop, spark, storm, hive

Machine Learning and AI: DK.Hadoop, spark

NLP module: upload server-side JAR package, directly support

Search engine module: not independently released

<!--EndFragment-->
 

 

<!--EndFragment-->

<!--EndFragment-->

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326217494&siteId=291194637