Docker and hadoop

Docker

 

Docker is hot, how to describe it? I feel that in addition to spark technology, open source is docker, and even the Go language has become popular, bringing Go's ranking in TIOBE from the top 100 to the ranks of mainstream languages.

Docker is about to become the savior. With such a powerful technology, what sparks do docker and hadoop collide with? Do you have to use it quickly?

I won't introduce what docker is. It is not a brand-new technology. It is an advanced container engine based on LXC and a lightweight isolation technology developed from the linux kernel. Compared with pure isolation, the core is to standardize the process of image packaging, deployment and release, which is equivalent to standardizing the development process. As far as the running state is concerned, compared with VM, the core advantage is light weight, the disadvantage is also obvious, the security is insufficient, and it is easy to break. The following figure is a comparison of a VM and a container:

 

The use of Docker in big data

 

Regarding LXC, google's large-scale cluster management tool borg claims to have been used ten years ago. The usage scenario is the big data scenario, and the batch/real-time scenarios are claimed to be well supported, and the cluster resource utilization rate is also very high, so follow this Speaking of which, big data and docker have deep roots.

But the reality is that docker is not very good in the hadoop field. There are currently two mainstream usages:

 

The first method is to use Docker to run Hadoop directly. For example, hortonworks has acquired a company called SequenceIQ. Through the technology called Cloudbreak, the Hortonworks Data Platform (HDP) is packaged into a Docker image. The advantage is that HDP can be started on any mainstream cloud platform such as Microsoft Azure, Amazon AWS, and Google Cloud Platform. . This solves the problem of deploying on multiple cloud platforms. But there has been no further news since the company was acquired. The last update on Github was also 5 months ago.

At most, this is only to solve the problem of the development environment. It is difficult for hadoop to perform consistently in different environments without tuning. It has limited usage scenarios and limited value.

 

The second method is to use Docker containers for application deployment through YARN. Yarn supports docker. For details, see:

http://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/DockerContainerExecutor.html

As a resource management, yarn has always been compressed in the field of big data due to its scalability. If the FAIR scheduling algorithm is sufficient for higher task-level resource utilization, stronger isolation limits the elastic use of resources.

The current resource scheduling is more popular with k8s (Google's main push, so-called developed from borg) and mesos (Berkeley University's main push). The target scenarios are more at the application level, and yarn support for docker is in an awkward position.

Outlook

 

In general, the hadoop system has its own set of resource management systems, and the problem to be solved is that multiple servers are scheduled to be used as one server in parallel. The docker technology is essentially the same as VM, which is to split a server into multiple copies for more applications. Docker and hadoop systems have very limited scenarios for physical machines under the cloud. In the future, there should be development in the cloud to replace VMs to solve the problem of elastic scaling.

 

 Welcome to the WeChat public account "Big Data and Cloud Computing Technology" to get the latest big data and cloud computing technology.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326219906&siteId=291194637