Spark study notes [basic concepts]

foreword

Although I am a back-end developer, due to the needs of the work, the knowledge of big data needs to be used later. I have studied hadoop-related knowledge for a while before. I will continue to have a deep understanding of the technology stack of big data when I have time in the future. While learning, while summarizing, I hope not to give up halfway.

Spark basics

What is Spark

Apache Spark is an open source distributed computing framework. Spark is designed to handle large-scale data processing tasks, and can perform fast batch processing, stream processing, and machine learning operations on large-scale data sets.
Spark official website: http://spark.incubator.apache.org/

The difference between spark and hadoop

  • Data processing model: Hadoop uses a batch processing model, which processes the entire data set at one time, while Spark supports both batch processing and stream processing models, which can process data streams in real time.
  • Processing Speed: Spark is faster than Hadoop. The speed of Spark is improved with the help of memory computing and memory-based data sharing mechanism, while the speed of Hadoop is limited by the speed of disk read and write.
  • Programming language: Hadoop is mostly programmed in Java, while Spark supports multiple programming languages ​​such as Java, Scala, Python, and R.
  • Memory management: Spark adopts a more efficient memory management method, which can cache data in memory, while Hadoop requires frequent disk read and write operations.
  • Database support: Spark can use various databases such as Hive, HBase, and Cassandra, while Hadoop mainly uses HDFS.
  • Ecosystem: Hadoop has a huge ecosystem, including tools such as Hive, Pig, and MapReduce, while Spark's ecosystem is relatively small, but it is constantly developing and growing.

Spark is more flexible and efficient than Hadoop in terms of speed, processing model, and programming language. At the same time, Spark is also compatible with Hadoop and can be used with Hadoop components to provide more choices and flexibility for big data processing.

Spark core modules

The main components of Spark include Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX, etc., and each component provides a series of API interfaces to facilitate developers to perform data processing and analysis tasks
The core modules of Spark include:

  • Spark Core: The basic module of Spark, which provides functions such as distributed task scheduling, memory management and fault tolerance. Spark Core provides a set of APIs that support the Scala, Java, and Python programming languages.
  • Spark SQL: Spark's SQL query module, which supports relational data processing, can query data through SQL statements, and can also interact with data warehouses such as Hive.
  • Spark Streaming: Spark's stream processing module supports real-time data stream processing. Spark Streaming provides a set of APIs that can convert real-time data streams into batch data for processing.
  • MLlib: Spark's machine learning module, which provides a set of machine learning algorithm libraries that support tasks such as classification, clustering, regression, and collaborative filtering.
  • GraphX: Spark's graph processing module that supports graph data processing and analysis.

In addition to these core modules, Spark provides many other modules and extensions, such as:

  • SparkR: Spark's R language interface, which supports data processing and analysis using Spark in R.
  • PySpark: Spark's Python language interface, which supports data processing and analysis using Spark in Python.
  • Spark Streaming Kafka: Supports integration with Apache Kafka in Spark Streaming.
  • Spark Streaming Flume: Supports integration with Apache Flume in Spark Streaming.
    These modules and extensions can meet different data processing and analysis needs.

Spark running mode

  1. Local mode: Run the Spark application on the local machine for development, testing and debugging.
  2. Standalone mode: use Spark's own cluster manager to run Spark applications. It supports deploying applications into a cluster of multiple nodes. If you need to achieve high availability in Standalone mode, you can use Standalone-HA mode.
  3. Standalone-HA mode: High availability is achieved in Standalone mode. By using ZooKeeper to coordinate the election and failure recovery of the master node, it is ensured that the Spark application can automatically switch to the standby node when the master node fails.
  4. On YARN mode: Run Spark applications on an Apache Hadoop YARN cluster. YARN is a Hadoop cluster management system that supports the operation of various distributed applications, including Spark.
  5. On Mesos mode: Run Spark applications on an Apache Mesos cluster. Mesos is a general-purpose cluster management system that supports the operation of various distributed applications, including Spark.
  6. On Cloud mode: Deploy Spark applications to run on cloud platforms, such as Amazon EMR, Google Cloud Dataproc, etc.

In the future, we will implement several common methods in detail.

Spark running architecture

Operating framework

insert image description here
The Spark running architecture mainly consists of the following four components:

  • Driver: The driver program is a process running on the master node, responsible for controlling the execution process of the entire application, including scheduling tasks, allocating resources, collecting and summarizing task execution results, etc. The driver program will request resources from the cluster manager and distribute tasks to the Executors in the cluster for execution.
  • Executor: Executor is a process running on a worker node, responsible for executing the tasks assigned to them by the driver program, and returning the task execution results to the driver program. Each Executor has its own JVM process, which can cache data during task execution to speed up task execution.
  • Cluster Manager: The cluster manager is a component used to manage cluster resources. Spark supports a variety of cluster managers, including Standalone, Apache Mesos, Hadoop YARN, etc. The cluster manager is responsible for receiving driver program requests and assigning appropriate Executors and resources to the driver program.
  • Spark Application: A Spark application is a distributed application program consisting of a driver program and a set of tasks. The driver program is responsible for controlling the execution process of the entire application program, and the tasks are executed on the Executor. Spark applications can implement various data processing and analysis tasks by writing components such as Spark SQL, Spark Streaming, and Spark MLlib.

In Spark's operating architecture, the driver program and Executor pass data and tasks through network communication, and the cluster manager is responsible for scheduling resources and managing the cluster. Since Spark uses memory computing, it can greatly improve the speed and efficiency of data processing and analysis.

Executor and Core (core)

In Spark, Executors are processes running on the worker nodes. They are responsible for executing the tasks assigned to them by the driver program and returning the task execution results to the driver program. Each Executor has its own JVM process, which can cache data during task execution to speed up task execution.

The Core (core) is the computing resource unit in the Executor, and each Executor is composed of multiple Cores. The number of Cores is usually determined by the hardware configuration. For example, if a node has 16 CPU cores, the Executor can be configured to use 8 cores.

In Spark, the number of Executors and the number of Cores used by each Executor can be set through configuration files. By increasing the number of Executors and the number of Cores used by each Executor, the parallelism and running speed of Spark applications can be improved. However, if the number of Executors is too large or the number of Cores used by each Executor is too large, resources may be wasted or insufficient, thus affecting the performance and stability of Spark applications. Therefore, when setting the number of Executors and Cores, it needs to be adjusted according to specific application scenarios and hardware configurations.

The application-related startup parameters are as follows:

name illustrate
–num-executors Configure the number of Executors
–executor-memory Configure the memory size of each Executor
–executor-cores Configure the number of virtual CPU cores for each Executor

Parallelism

Parallelism refers to the ability to process multiple tasks or data at the same time. In computer science, it is often used to describe how many tasks or data an application or system can process simultaneously. In Spark, parallelism refers to the ability to process multiple tasks or data in a Spark application at the same time, usually represented by the number of Executors and the number of Cores used by each Executor.

In Spark applications, increasing the degree of parallelism can improve the speed and efficiency of task execution, thereby speeding up data processing and analysis. In practical applications, there are mainly the following methods to improve parallelism:

  • Increase the number of Executors: By increasing the number of Executors, the parallelism of the Spark application can be increased, thereby speeding up task execution. It should be noted that increasing the number of Executors requires reasonable settings based on hardware configuration and application resource consumption.
  • Increase the number of Cores used by each Executor: By increasing the number of Cores used by each Executor, the computing power of each Executor can be increased, thereby speeding up task execution. It should be noted that increasing the number of Cores also requires reasonable settings based on hardware configuration and application resource consumption.
  • Use data partitioning: Data partitioning is a technology that divides data into multiple parts according to certain rules for parallel processing. By using data partitions, data can be distributed to different Executors for processing, thereby improving parallelism and execution efficiency.
  • Using Parallel Algorithms: A parallel algorithm is an algorithm that can be executed on multiple processors at the same time. By using parallel algorithms, tasks can be decomposed into multiple parts for parallel processing, thereby improving parallelism and execution efficiency.

Directed Acyclic Graph (DAG)

insert image description here
A Directed Acyclic Graph (DAG for short) is a graph structure composed of nodes connected by directed edges. There is no cycle in this structure (that is, any node connected by an edge cannot form a ring), and it is usually used It is used to describe the dependencies in the calculation process or data processing process.

In Spark, DAG is a very important concept, which is used to describe the data processing process in Spark applications. In Spark applications, DAG consists of a series of RDD (Resilient Distributed Datasets) and transformation operations. Each RDD represents a distributed data set, and the transformation operation is used to transform and calculate the RDD. Every operation in a Spark application generates a new RDD and joins it to the DAG. Each node in the DAG represents an RDD, and each directed edge represents a transformation operation.

In Spark applications, the generation and optimization of DAG is the responsibility of Spark's task scheduler. When a Spark application is submitted to the cluster to run, Spark will convert each operation in the application into a set of tasks, and then organize the tasks into a DAG according to dependencies. Then, Spark will optimize the DAG, such as merging adjacent operations, removing useless operations, etc., so as to improve the execution efficiency and parallelism of tasks. Finally, Spark will split the DAG into multiple stages, and each stage can be executed in parallel, thereby improving the efficiency of task execution.

In simple terms, DAG is a graph structure used to describe the dependencies in the data processing process. In Spark, DAG is used to describe the conversion relationship and operation sequence between RDDs. Through DAG, the execution plan of tasks can be optimized. , to improve the parallelism and execution efficiency of tasks.

How to submit spark

client

In client mode, the Driver program runs in the client process that submits the Spark application, while the Executor program runs on the computing nodes in the cluster. This method of submission is often used in debugging and development environments, because it makes it easier to view and debug the running status and results of the application.

cluster

In cluster mode, the Driver program runs on a node in the cluster, and the Executor program also runs on other nodes in the cluster. This submission method is usually used in a production environment because it can better utilize cluster resources and improve task parallelism and execution efficiency.

In domestic work, it is more likely to deploy Spark references to the Yarn environment, so the submission process in this course is based on the Yarn environment.
insert image description here

Spark core programming

Three major data structures

  • RDD (Resilient Distributed Datasets): RDD is one of the most basic distributed data structures in Spark. It is an immutable collection of distributed objects. Each partition in the RDD stores a part of the data and is distributed across multiple nodes in the cluster. RDDs provide a number of transformation and action operations that can transform and compute on them.

  • Accumulator: The accumulator is a special variable that can be operated in parallel in a distributed environment. The accumulator can only perform addition operations, and can only accumulate data from the Driver side to the Executor side, not the other way around. Accumulators are mainly used for aggregation operations such as counting and summation.

  • Broadcast Variable: A broadcast variable is a read-only variable that can be shared across the cluster. In the process of distributed computing, if some variables need to be shared among multiple nodes, these variables can be transmitted by means of broadcast variables. The broadcast variable will only be sent once, and then cached on the Executor side for subsequent use, which can reduce network transmission and memory usage.

RDD

What is RDD

RDD (Resilient Distributed Datasets) is one of the most basic distributed data structures in Spark. It is an immutable collection of distributed objects. Each partition in the RDD stores a part of the data and is distributed across multiple nodes in the cluster. RDDs provide a number of transformation and action operations that can transform and compute on them.

The characteristics of RDD are as follows:

  • Distributed: The data in RDD is distributed on multiple nodes of the cluster, enabling parallel computing.
  • Immutability: RDDs are immutable and cannot be modified once created. If you need to transform the RDD, you need to generate a new RDD.
  • Fault tolerance: RDD can achieve fault tolerance through data division and backup. If a node fails, the data can be recalculated from the backup.
    Lazy calculation: The conversion operation in Spark is calculated lazily, and the data is actually calculated only when the action operation is executed.

Principle of execution

In the Yarn environment, Spark's RDD works as follows:

  • First, the Driver program will request resources from the ResourceManager and start the ApplicationMaster.
  • ApplicationMaster will apply for Container from ResourceManager, and then start Executor in Container.
  • The Executor will request the RDD from the Driver program, and divide the data into multiple partitions and store them in memory.
  • The Driver program will distribute the tasks to be executed (including conversion operations and action operations) to the Executor for execution.
  • In Executor, tasks are organized into a directed acyclic graph (DAG) in terms of dependencies.
  • When the Driver program performs an action, the task will be submitted to the Executor for execution. If a task depends on the output of other tasks, it will be executed after the previous tasks have completed.
  • After the execution is completed, the Executor will return the result to the Driver program.

RDD API

RDD creation

  1. Create an RDD from an in-memory collection (Parallelized Collections):
val rdd = sc.parallelize(1 to 100)
  1. Create an RDD by reading data from an external storage system (such as HDFS, local file system, etc.):
    Use the SparkContext.textFile() method to create an RDD of a text file from one or more files. For example, the following code can be used to read a file from HDFS to create an RDD:
val rdd = sc.textFile("hdfs://localhost:9000/data.txt")
  1. Create a new RDD by transforming an existing RDD:
    A new RDD can be created by performing a series of transformation operations on an existing RDD. For example, the following code can be used to create a new RDD where each number is doubled:
val rdd1 = sc.parallelize(1 to 100)
val rdd2 = rdd1.map(_ * 2)
  1. Create an RDD by parallelizing an existing dataset:
    Use the SparkContext.newAPIHadoopRDD() method to create an RDD from an existing Hadoop input format (such as TextInputFormat). For example, the following code can be used to create an RDD from a file in the Hadoop input format:
val conf = new Configuration()
val file = sc.newAPIHadoopFile("hdfs://localhost:9000/data.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
val rdd = file.map(pair => pair._2.toString)

RDD conversion operator

conversion operator meaning
map(func) Returns a new RDD consisting of each input element transformed by the func function
filter(func) Returns a new RDD consisting of input elements that return true after being evaluated by the func function
flatMap(func) Similar to map, but each input element can be mapped to 0 or more output elements (so func should return a sequence, not a single element)
mapPartitions(func) Similar to map, but operates on each slice of RDD independently, so when operating on RDD of type T, the function type of func must be Iterator[T] => Iterator[U]
mapPartitionsWithIndex(func) Similar to mapPartitions, but func has an integer parameter representing the index value of the partition, so when running on an RDD of type T, the function type of func must be (Int, Interator[T]) => Iterator[U]
sample(withReplacement, fraction, seed) The data is sampled according to the ratio specified by fraction, and you can choose whether to use random numbers for replacement, and seed is used to specify the seed of the random number generator
union(otherDataset) Returns a new RDD after the union of the source RDD and the parameter RDD
intersection(otherDataset) Returns a new RDD after intersecting the source RDD and the parameter RDD
distinct([numTasks])) Return a new RDD after deduplicating the source RDD
groupByKey([numTasks]) Called on an RDD of (K, V), returns an RDD of (K, Iterator[V])
reduceByKey(func, [numTasks]) Called on a (K, V) RDD, return a (K, V) RDD, use the specified reduce function to aggregate the values ​​of the same key together, similar to groupByKey, the number of reduce tasks can be passed through the second optional parameter to set
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) The aggregation operation is performed on the same Key value in PairRDD, and a neutral initial value is also used in the aggregation process. Similar to the aggregate function, the type of the return value of aggregateByKey does not need to be consistent with the type of value in the RDD
sortByKey([ascending], [numTasks]) Called on a (K, V) RDD, K must implement the Ordered interface, and return a (K, V) RDD sorted by key
sortBy(func, [ascending], [numTasks]) Similar to sortByKey, but more flexible
join(otherDataset, [numTasks]) Called on RDDs of type (K, V) and (K, W), returns a (K, (V, W)) RDD of all pairs of elements corresponding to the same key
cogroup(otherDataset, [numTasks]) Called on RDDs of type (K,V) and (K,W), returns an RDD of type (K,(Iterable,Iterable))
cartesian(otherDataset) Cartesian Product
pipe(command, [envVars]) Pipeline operations on rdd
coalesce(numPartitions) Reduce the number of partitions of RDD to the specified value. After filtering a large amount of data, you can do this
repartition(numPartitions) Repartition the RDD

Action action operator

action operator meaning
reduce(func) Aggregate all elements in the RDD through the func function, this function must be exchangeable and parallelizable
collect() In the driver program, return all elements of the dataset as an array
count() Returns the number of elements in the RDD
first() Returns the first element of the RDD (similar to take(1))
take(n) Returns an array consisting of the first n elements of the dataset
takeSample(withReplacement,num, [seed]) 返回一个数组,该数组由从数据集中随机采样的 num 个元素组成,可以选择是否用随机数替换不足的部分,seed 用于指定随机数生成器种子
takeOrdered(n, [ordering]) 返回自然顺序或者自定义顺序的前 n 个元素
saveAsTextFile(path) 将数据集的元素以 textfile 的形式保存到 HDFS 文件系统或者其他支持的文件系统,对于每个元素,Spark 将会调用 toString 方法,将它装换为文件中的文本
saveAsSequenceFile(path) 将数据集中的元素以 Hadoop sequencefile 的格式保存到指定的目录下,可以使 HDFS 或者其他 Hadoop 支持的文件系统
saveAsObjectFile(path) 将数据集的元素,以 Java 序列化的方式保存到指定的目录下
countByKey() 针对(K,V)类型的 RDD,返回一个(K,Int)的 map,表示每一个 key 对应的元素个数
foreach(func) 在数据集的每一个元素上,运行函数 func 进行更新
foreachPartition(func) 在数据集的每一个分区上,运行函数 func

统计操作

算子 含义
count 个数
mean 均值
sum 求和
max 最大值
min 最小值
variance 方差
sampleVariance 从采样中计算方差
stdev 标准差:衡量数据的离散程度
sampleStdev 采样的标准差
stats 查看统计结果

RDD序列化

RDD序列化是Spark中的一个重要概念,它是指将RDD中的数据对象转换为字节流的过程,以便在不同的节点之间进行网络传输或磁盘存储。在Spark中,需要对RDD进行序列化是因为RDD在分布式计算中需要在多个节点之间传输和存储,而这些节点的操作系统和硬件环境可能不同,因此需要将RDD数据对象进行序列化,以便能够在不同的节点之间传输和存储。

Spark支持两种类型的RDD序列化方式:Java序列化和Kryo序列化。Java序列化是JVM自带的序列化机制,它具有通用性,但是效率较低。Kryo序列化是一个高性能的序列化库,它能够将对象序列化成较小的字节数组,从而提高网络传输和磁盘存储的效率。在使用Kryo序列化时,需要先注册需要序列化的类,以便Kryo能够正确地序列化和反序列化这些对象。

在Spark中,默认情况下使用Java序列化方式,但是可以通过设置SparkConf的“spark.serializer”属性来指定使用Kryo序列化方式,例如:

val conf = new SparkConf()
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(conf)

RDD依赖关系

在Spark中,RDD的依赖关系是指一个RDD与其他RDD之间的关系,包括它们之间的转换操作和依赖类型。根据依赖类型的不同,RDD的依赖关系可以分为两种:宽依赖和窄依赖。

  • 窄依赖(Narrow Dependency):指一个RDD的每个分区只依赖于另一个RDD的一个或多个分区。例如,map、filter等转换操作都是窄依赖。

  • 宽依赖(Wide Dependency):指一个RDD的每个分区依赖于另一个RDD的多个分区,即一个RDD的每个分区需要与另一个RDD的所有分区进行计算。例如,reduceByKey、groupByKey等转换操作都是宽依赖。

在Spark中,RDD的依赖关系是以有向无环图(DAG)的形式组织起来的。每个RDD都有一组父RDD和一组子RDD,每个父RDD与子RDD之间都有一条有向边来表示依赖关系。在一个DAG中,每个RDD都是通过一系列转换操作从原始数据集导出的,最终得到的RDD被称为输出RDD。

在Spark中,DAG的构建是惰性的,也就是说,只有在需要执行操作时才会计算DAG。这种惰性计算的好处是能够避免不必要的计算,提高计算效率。同时,由于DAG是有向无环图,因此可以通过优化DAG来减少计算的开销,提高计算性能。

RDD持久化

在Spark中,RDD持久化(Persistence)指的是将RDD的数据缓存到内存或磁盘中,以便后续重复使用,可以使用RDD的persist()方法或cache()方法实现。这两个方法的作用是一样的,都可以将RDD的数据缓存到内存或磁盘中。

这两个方法的使用方式非常简单,只需要在RDD上调用persist()方法或cache()方法,并指定缓存的级别即可。例如:

val rdd = sc.parallelize(Seq(1, 2, 3))
rdd.persist(StorageLevel.MEMORY_ONLY)

或者:

val rdd = sc.parallelize(Seq(1, 2, 3))
rdd.cache()

其中,StorageLevel.MEMORY_ONLY指定了缓存级别为内存缓存,表示将RDD的数据存储在内存中。除了MEMORY_ONLY外,Spark还支持其他多种缓存级别,如MEMORY_ONLY_SER、MEMORY_AND_DISK、MEMORY_AND_DISK_SER等,可以根据实际需求进行选择。

需要注意的是,持久化RDD需要消耗内存或磁盘空间,因此需要根据实际情况来选择合适的缓存级别。另外,可以使用unpersist()方法来释放缓存的RDD数据,例如:

rdd.unpersist()

当调用persist()或cache()方法后,Spark会根据缓存级别和当前的可用内存或磁盘空间来决定RDD数据是放在内存中还是磁盘中。如果数据太大无法全部缓存到内存中,Spark会按照缓存级别和LRU算法来淘汰不常用的数据。如果数据在内存中被淘汰,则后续使用该数据时需要重新计算。

RDD分区器

在Spark中,RDD的分区(Partition)是指将一个RDD的数据集划分成若干个小的数据块,每个数据块称为一个分区,每个分区都可以被一个Task并行处理。RDD的分区是Spark实现高效计算的关键,因为它可以将数据并行处理,从而提高计算性能。

而RDD分区器(Partitioner)则是对RDD的分区进行进一步的优化,即对数据进行重新分区,以便更好地利用Spark的并行计算能力。RDD分区器主要用于控制RDD的数据如何被分配到不同的节点上进行计算,从而更好地利用集群的资源。

在Spark中,RDD分区器分为两种类型:哈希分区器和范围分区器。哈希分区器(HashPartitioner)是根据键的哈希值来对RDD进行分区,而范围分区器(RangePartitioner)则是根据键的范围来对RDD进行分区。对于哈希分区器,Spark默认使用的是HashPartitioner,而对于范围分区器,Spark会根据数据的范围和RDD的分区数自动选择使用RangePartitioner。

在使用RDD的groupByKey、reduceByKey等聚合操作时,需要使用到哈希分区器。在使用RDD的sortByKey、join等操作时,需要使用到范围分区器。可以通过对RDD调用partitionBy方法来指定分区器,例如:

val rdd = sc.parallelize(Seq(("a", 1), ("b", 2), ("c", 3)))
val partitioner = new HashPartitioner(2)
val partitionedRDD = rdd.partitionBy(partitioner)

在上面的例子中,我们创建了一个包含三个元素的RDD,并使用HashPartitioner对其进行分区,分为两个分区。其中,HashPartitioner的构造函数需要指定分区数,这里指定为2。然后,我们使用RDD的partitionBy方法将RDD重新分区,并指定使用HashPartitioner进行分区。这样,RDD的数据就会被重新分配到两个节点上,以便更好地利用集群的资源进行计算。

RDD文件读取与保存

在Spark中,可以使用RDD的读取和保存操作,从而将数据加载到RDD中,或将RDD中的数据保存到外部存储系统中。常用的RDD读取和保存方式包括:

从文件中读取数据
可以使用sc.textFile()方法从文件中读取数据,例如:

val rdd = sc.textFile("path/to/file")

其中,path/to/file指定要读取的文件路径,该方法返回一个包含文件中所有行的RDD。

将RDD保存到文件中
可以使用RDD.saveAsTextFile()方法将RDD中的数据保存到文件中,例如:

val rdd = sc.parallelize(Seq("Hello", "World", "Spark"))
rdd.saveAsTextFile("path/to/output")

其中,path/to/output指定要保存的文件路径,该方法将RDD中的数据保存为文本格式。

Reading data from the Hadoop file system
You can use the sc.hadoopFile() method to read data from the Hadoop file system, for example:

val rdd = sc.hadoopFile("path/to/hdfs/file", classOf[TextInputFormat], classOf[LongWritable], classOf[Text])

Among them, path/to/hdfs/file specifies the Hadoop file path to be read, TextInputFormat specifies the file format to be read, LongWritable and Text specify the type of key and value respectively, and this method returns an RDD containing all lines in the Hadoop file .

Save RDD to the Hadoop file system
You can use the RDD.saveAsHadoopFile() method to save the data in the RDD to the Hadoop file system, for example:

val rdd = sc.parallelize(Seq("Hello", "World", "Spark"))
rdd.saveAsHadoopFile("path/to/hdfs/output", classOf[Text], classOf[Text], classOf[TextOutputFormat])

Among them, path/to/hdfs/output specifies the path of the Hadoop file to be saved, Text specifies the type of the key and value to be saved, and TextOutputFormat specifies the file format to be saved. This method saves the data in the RDD as the Hadoop file format.

Guess you like

Origin blog.csdn.net/qq_33129875/article/details/129643298