"Hadoop and Big Data Mining" - 1.2 Big Data Platform

Abstract: This chapter is an excerpt from Chapter 1, Section 1.2 of the book "Hadoop and Big Data Mining" by Huazhang Computer. Zhang Liangjun, Fan Zhewei, Wenchao, Liu Mingjun, Xu Guojie, Zhou Longjiao, Zhengsheng, and more chapters can be found in Yunqi Community " "Huazhang Computer" official account to view. 1.2 Big data platform What are the big data platforms? It is generally believed that the big data platform is divided into two aspects, the hardware platform and the software platform.

This chapter is an excerpt from Chapter 1, Section 1.2 in the book "Hadoop and Big Data Mining" by Huazhang Computer. "View the public account.

1.2 Big Data
Platform What are the big data platforms?
It is generally believed that the big data platform is divided into two aspects, the hardware platform and the software platform. Hardware platforms are generally such as Open-Stack, Amazon cloud platform, Alibaba cloud computing, etc. What platforms like this actually do is virtualization, that is to virtualize multiple machines or one machine into a resource pool, and then provide thousands of people with it. use, and rent the corresponding resource services, etc. The software platform is often heard, such as Hadoop, MapReduce, Spark, etc. It can also be understood as the Hadoop ecosystem in a narrow sense, that is, the integration of multiple node resources (which can be virtual node resources), as a cluster to provide external storage and computational analysis services.
The big data platforms in the Hadoop ecosystem can be roughly divided into three types: Apache Hadoop (native open source Hadoop), Hadoop Distribution (Hadoop distribution), and Big Data Suite (big data development suite). Apache Hadoop is native, that is, provided by the official website, and only contains basic software; Hadoop Distribution is provided by some software suppliers and has relatively many functions. This version has a paid version and a free version, which users can choose; The suite is an integrated solution provided by some large companies, which provides more functions, but is also relatively expensive.
Apache Hadoop is open source and users can directly access or change the code. It is completely distributed, and the configuration includes user permissions, access control, etc., plus a variety of ecosystem software support, which is more complicated. There is a version incompatibility issue here. Therefore, this version is more suitable for learning and understanding the underlying details or Hadoop detailed configuration, tuning, etc.
The Hadoop Distribution version simplifies the user's operation and development tasks, such as one-click deployment, etc., and has supporting ecosystem support and management monitoring functions, such as HDP, CDH, MapR and other platforms widely used in the industry. CDH is the most formed distribution, has the most deployment cases, and provides powerful deployment, management and monitoring tools, and its development company Cloudera has contributed its own Impala project that can process big data in real time. HDP is the only provider of 100% open source Apache Hadoop, and its development company Hortonworks has developed many enhancements and submitted them to the core trunk, and Hortonworks provides a very good, easy-to-use sandbox for beginners. MapR supports native UNIX filesystems instead of HDFS (using non-open source components) for better performance and ease of use, and can use native UNIX commands instead of Hadoop commands. In addition to this, MapR differentiates itself from its competitors by virtue of high availability features such as snapshots, mirroring, or stateful failover. When you need a simple learning environment, you can choose this version. Of course, you can also choose the paid version of this version for some enterprises, and there are many software supports.
Big Data Suite (Big Data Suite) is built on IDE such as Eclipse, and its additional plug-ins greatly facilitate the development of big data applications. Users can create, build, and deploy big data services within their familiar development environment, and generate all code, eliminating the need to write, debug, analyze, and optimize MapReduce code. Big Data Suite provides graphical tools to model your big data services, all required code is automatically generated, and complex big data jobs can be implemented only by configuring certain parameters. When enterprise users need the integration of different data sources, automatic code generation or automatic graphical scheduling of big data jobs, they can choose to use the big data suite.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326191785&siteId=291194637