Spark knowledge of large data collections

Spark iterative calculation is based on a frame memory for operation of a particular set of data require multiple applications. The more often the need for repeated operations, the greater the amount of data required to read, the greater the benefit, the amount of data-intensive computing small but larger occasions, benefit from relatively small (large database schema that is considering whether to use Spark an important factor).

1. What is the core of Spark?

RDD is a basic abstraction of Spark, abstract use of distributed-memory, so as to achieve an abstract set of operations to the local operation of the distributed data sets achieved. Spark RDD also very core of things, it represents has been partitioned, immutable and can be operated in parallel collection of data of different sets corresponding to different data formats RDD achieved.

RDD must be serializable. RDD can cache in memory, the result of each operation after the RDD data set can be stored in memory, the next operation can be input directly from memory, eliminating the need for a lot of disk IO operations MapReduce. This iteration of the more common machine learning algorithms, interactive data mining, the efficiency gains is relatively large.

2. What are the applicable scene Spark?

Due to the nature of RDD, Spark kind of fine-grained application asynchronous status update is not applicable, such as web crawlers and index the web service is stored or incremental. For the application of the model is that incremental changes are not suitable. Spark of general applicability compare a wide range of more generic.

3, Spark supported programming languages ​​are there?

Spark exposed through the operation of RDD programming language integrated manner, and similar DryadLINQ FlumeJava, each data set is represented as an object RDD, the operation of the operation data set to represent the object pair RDD. Spark major programming languages ​​supported are Scala, java, python.

Scala

Spark using Scala to develop, use Scala programming language as the default. Spark writing program is much simpler than writing Hadoop MapReduce program, SparK provides Spark-Shell, you can test the program in Spark-Shell.

Java

Spark supports Java programming, but for the use of Java, there is no such convenient tool Spark-Shell, and other Scala programming is the same, because the language is on the JVM, Scala can interoperate with Java, Java programming interface is actually on Scala package.

Python

Spark now also provides the Python programming interface, Spark implemented using py4j java python with interoperability so implemented using programs written in python Spark. Spark also offers pyspark, a Spark of python shell, can be written in Python interactive manner Spark program.

Guess you like

Origin blog.csdn.net/kangshufu/article/details/92427607