[Ubuntu-Big Data] spark installation configuration


Reference article: http://dblab.xmu.edu.cn/blog/931-2/

1. (Hadoop3 needs to be installed first for implementation)

1. Official website download: version 3

Insert image description here
Insert image description here

2. Stand-alone mode installation configuration:

There are four main Spark deployment modes:

  • Local mode (stand-alone mode)
  • Standalone mode (using the simple cluster manager that comes with Spark),
  • YARN mode (using YARN as cluster manager)
  • Mesos mode (use Mesos as cluster manager).

(1) Unzip the compressed package to the directory where Hadoop was previously installed: local computer-usr-local

Insert image description here
(2) cd into the directory:
Insert image description here
(3) The name of the decompressed compressed package is too long, we changed it to spark:
Insert image description here
(4) Grant the relevant permissions to the file (lpp2 is your hadoop name, which can be viewed in the [User] setting )
Insert image description here
(5) Enter the spark file directory:
Insert image description here
(6) Use cp to copy the configuration file and name it:
Insert image description here
(7) Edit the spark-env.sh file (vim ./conf/spark-env.sh), in the first line Add the following configuration information:

export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)

Insert image description here
Press i to enter insert mode. After the insertion is completed,
press the escape key ESC, then use shift+colon key to enter an English colon, and it will generate a colon at the bottom.
At this time, we enter wq, press Enter, and then [Save and Exit] the vim editor.
If you are worried, you can enter the file management to check:
Insert image description here
if you are careful, you will find that it is actually adding a hadoop path.
In fact, after having the above configuration information, 【Spark】就可以把数据存储到【Hadoop分布式文件系统HDFS】中,也可以从HDFS中读取数据. (If 没有the above information is configured, Spark will 只能read and write local data and 无法read and write HDFS data)

(8) Directly use this command under its bin to run it: (Run its own instance to check whether the installation is successful)
Insert image description here
(9) Use pipes to filter information:

 bin/run-example SparkPi 2>&1 | grep "Pi is"

If the output is: π, then
Insert image description here

3. Run Spark applications on the cluster

  • 1.standalone mode
    Similar to the MapReduce1.0 framework, the Spark framework itself also comes with complete resource scheduling and management services, which can be independently deployed in a cluster without relying on other systems to provide resource management and scheduling services. In terms of architectural design, Spark is completely consistent with MapReduce1.0. Both are composed of a Master and several Slaves, and use slots as the resource allocation unit. The difference is that the slots in Spark are no longer divided into Map slots and Reduce slots like MapReduce1.0. Instead, only one unified slot is designed for use by various tasks.
  • 2.Spark on Mesos mode
    Mesos is a resource scheduling and management framework that can provide services for Spark running on it. In the Spark on Mesos mode, Mesos is responsible for scheduling various resources required by the Spark program. Since Mesos and Spark have a certain blood relationship, when the Spark framework is designed and developed, full support for Mesos is fully considered. Therefore, relatively speaking, Spark is more efficient when running on Mesos than when running on YARN. More flexible and natural. Currently, Spark officially recommends this model, so many companies also use this model in practical applications.
  1. Spark on YARN mode
    Spark can run on YARN and be deployed uniformly with Hadoop, that is, "Spark on YARN". Its architecture is shown in Figure 9-13. Resource management and scheduling rely on YARN, and distributed storage relies on HDFS.

Cluster environment construction: http://dblab.xmu.edu.cn/blog/1187-2/

Here we use 3 machines (nodes) as examples to demonstrate how to build a Spark cluster.

  • Among them, 1 machine (node) is as Master节点,
  • The other two machines (nodes) serve as Slave节点(that is, as Worker nodes), and their host names are Slave01 and Slave02 respectively.

pending upgrade…

Guess you like

Origin blog.csdn.net/zhinengxiong6/article/details/127174173