Open road of learning Hadoop and Spark

Hadoop

Hadoop is an Apache Foundation to develop a distributed system infrastructure.
Users can distributed without knowing the underlying details of the development of distributed applications. Take full advantage of the power of a cluster of high-speed computing and storage.
Hadoop implements a distributed file system (Hadoop Distributed File System), referred to HDFS. HDFS fault tolerant characteristic, and designed to be deployed on low (low-cost) hardware; and it provides a high throughput (high throughput) to access the application data, for those (large data sets with large data set) applications. HDFS relaxed requirements (relax) POSIX, and can access data (streaming access) file system in the form of a stream.
The core design of Hadoop framework is: HDFS and MapReduce. HDFS provides storage of vast amounts of data, while the data provided MapReduce mass was calculated.

Hadoop main advantages:

Hadoop is a framework that allows users to easily and use distributed computing platform. Users can easily develop and run applications handling massive amounts of data in Hadoop. It mainly has the following advantages:
high reliability. Hadoop by-bit memory capacity and processing of data worthy of trust.
High scalability. Hadoop is allocated among the available computer data clusters and complete computing tasks, these clusters can be easily extended to thousands of nodes.
Efficiency. Hadoop is possible to dynamically move data between nodes, and each node to ensure dynamic balance, thus the processing speed is very fast.
High fault tolerance. Hadoop can automatically save multiple copies of data, and can automatically reassign tasks will fail.
low cost. Compared with one machine, commercial data warehouse and QlikView, Yonghong Z-Suite and other data marts, hadoop is open source, software costs of the project therefore will be greatly reduced.

Spark

Apache Spark is designed for large-scale data processing designed for fast general-purpose computing engine. Spark is common to the parallel frame class Hadoop MapReduce, Hadoop MapReduce has Spark has the advantage that; but unlike MapReduce is --Job intermediate output can be stored in memory, eliminating the need to read and write the HDFS, and therefore better Spark MapReduce algorithms applied to data mining and machine learning needs iteration.
Spark is similar to the one with the open-source Hadoop cluster computing environment, but there are still some differences between the two, these useful differences between the Spark in certain workloads performance was superior, in other words, to enable Spark memory distributed data sets, in addition to providing interactive query, it also can optimize iterative workloads.
Spark is implemented in the Scala language, it will Scala as its application framework. And different Hadoop, Spark and Scala can be tightly integrated, which Scala can operate as a local collection of objects as easily as operating a distributed data sets.
Although Spark created to support iterative jobs on distributed data sets, but in fact it is a complement to Hadoop, it can run in parallel on Hadoop file system. This behavior can be supported by third-party clustering framework called Mesos's.

Spark has three main features:

First, high-level API stripped concern about the cluster itself, Spark application developers can focus on computing applications do itself.
Secondly, Spark quickly to support interactive computing and complex algorithms.
Finally, Spark is a general-purpose engine, it can be used to accomplish a variety of operations, including SQL queries, text processing, machine learning, and appeared before the Spark, we generally need to learn to deal with these kinds of engines, respectively, demand.

Guess you like

Origin www.cnblogs.com/lph970417/p/11423691.html