What foundation do you need to learn hadoop big data infrastructure

What is big data? Since the beginning of this century, especially after 2010, with the development of the Internet, especially the mobile Internet, the growth of data has exploded. It is difficult to estimate how much data is stored in electronic devices around the world. The data describing the data system The unit of measurement of the amount has been rising from MB (1MB is approximately equal to one million bytes), GB (1024MB), TB (1024GB), and at present, PB (equal to 1024TB) level data systems are already very common. The amount of data, social networking sites, scientific computing, securities trading, website logs, and sensor network data continues to increase, and the total amount of domestic data has already exceeded the level of ZB (1ZB=1024EB, 1EB=1024PB).

The traditional data processing method is: as the amount of data increases, the hardware indicators are continuously updated, and measures such as more powerful CPUs and larger capacity disks are adopted, but the reality is: the speed of the increase in the amount of data far exceeds that of a single machine. The speed at which computing and storage power increases.

The processing method of "big data" is: using multiple machines and multiple nodes to process large amounts of data, and adopting this new processing method requires a new big data system to ensure that the system needs to handle communication between multiple nodes. Coordination, data separation and a series of issues.

In a word, using multi-machine and multi-node methods to solve the communication coordination, data coordination and calculation coordination problems of each node, and the way to deal with massive data, is the thinking of "big data". Its characteristic is that as the amount of data continues to increase, the number of machines can be increased and horizontal expansion, a big data system can be up to tens of thousands of machines or even more.


Hadoop initially mainly includes two parts, the distributed file system HDFS and the computing framework MapReduce, which is an independent project from Nutch. In version 2.0, the resource management and task scheduling functions were stripped from MapReduce to form YARN, so that other frameworks can also run on Hadoop like MapReduce. Compared with the previous distributed computing framework, Hadoop hides many tedious details, such as fault tolerance, load balancing, etc., making it easier to use.

Hadoop also has a strong horizontal expansion capability, and new computers can be easily connected to the cluster to participate in computing. With the support of the open source community, Hadoop continues to develop and improve, and integrates many excellent products such as non-relational database HBase, data warehouse Hive, data processing tool Sqoop, machine learning algorithm library Mahout, consistency service software ZooKeeper, management tool Ambari, etc. , forming a relatively complete ecosystem and the de facto standard for distributed computing.

Dakuai 's big data general computing platform ( DKH) has integrated all the components of the development framework with the same version number. If the Dakuai development framework is deployed on the open source big data framework, the components of the platform need to be supported as follows:

Data source and SQL engine: DK.Hadoop, spark, hive, sqoop, flume, kafka

Data collection: DK.hadoop

Data processing module: DK.Hadoop, spark, storm, hive

Machine Learning and AI: DK.Hadoop, spark

NLP module: upload server-side JAR package, directly support

Search engine module: not independently released

Da Kuai Big Data Platform ( DKH) is a one-stop search engine-level, big data general computing platform designed by Da Kuai Company to open up the channel between the big data ecosystem and traditional non-big data companies. By using DKH, traditional companies can easily cross the technical gap of big data and realize the performance of big data platform at the level of search engines.

l     DKH effectively integrates all the components of the entire HADOOP ecosystem, and is deeply optimized and recompiled into a complete higher-performance big data general computing platform, which realizes the organic coordination of various components. Therefore, compared with the open source big data platform, DKH has up to 5 times (maximum) performance improvement in computing performance.

l     DKH, through the unique middleware technology of Dakuai, simplifies the complex big data cluster configuration to three nodes (master node, management node, computing node), which greatly simplifies the management, operation and maintenance of the cluster, and enhances the High availability, high maintainability, and high stability of the cluster.

<!--EndFragment-->

l   DKH, although highly integrated, still maintains all the advantages of open source systems and is 100% compatible with open source systems. Big data applications developed based on open source platforms can run efficiently on DKH without any changes. And the performance will be improved by up to 5 times.

l     DKH also integrates the big data integrated development framework (FreeRCH) of Dakuai. The FreeRCH development framework provides more than 20 categories commonly used in big data, search, natural language processing and artificial intelligence development. There are more than 10 ways to improve the development efficiency by more than 10 times.

l     The SQL version of DKH also provides the integration of distributed MySQL, and the traditional information system can seamlessly realize the leap for big data and distributed.

<!--EndFragment-->

DKH standard platform technology framework



 

<!--EndFragment-->
 

 

<!--EndFragment-->

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326180951&siteId=291194637