[Spark] spark-2.4.4 installation and testing

4.2.1 Download and install the spark

 

 


Download File Name: spark-2.4.4-bin-without-hadoop.tgz

[hadoop@hadoop01 ~]$ tar -zxvf spark-2.4.4-bin-without-hadoop.tgz


4.2.2 Configuring linux environment variables

[hadoop@hadoop01 ~]$ gedit ~/.bash_profile
[hadoop@hadoop01 ~]$ source ~/.bash_profile
新加入:
#spark
export SPARK_HOME=/home/hadoop/spark-2.4.4-bin-without-hadoop
export PATH=$PATH:$SPARK_HOME/bin


4.2.3 spark-env.sh environment variables

[hadoop@hadoop01 conf]$ cp spark-env.sh.template spark-env.sh
[hadoop@hadoop01 conf]$ gedit spark-env.sh
加入:
export JAVA_HOME=/usr/java/jdk1.8.0_131
export SCALA_HOME=/home/hadoop/scala-2.13.1
export SPARK_MASTER_IP=192.168.1.100
export HADOOP_HOME=/home/hadoop/hadoop-3.2.0
export HADOOP_CONF_DIR=/home/hadoop/hadoop-3.2.0/etc/hadoop
export SPARK_DIST_CLASSPATH=$(/home/hadoop/hadoop-3.2.0/bin/hadoop classpath) #不添加,就会报错

 

4.2.4 modify slaves file

[hadoop@hadoop01 conf]$ cp slaves.template slaves
[hadoop@hadoop01 conf]$ gedit slaves
加入:
hadoop02
hadoop03


4.2.5 copy to another node
SCP -R & lt ~ / Spark-bin-2.4.4-hadoop02 the without-Hadoop: ~ /
SCP -R & lt ~ / Spark-2.4.4-bin-Hadoop hadoop03 the without-: ~ /

4.2.6 start the Spark

4.2.6.1 direct start

[hadoop@hadoop01 spark-2.4.4-bin-without-hadoop]$ sbin/start-all.sh
第一次启动报错:
进程显示:failed to launch: nice -n 0 /home/hadoop/spark-2.4.4-bin-without-hadoop/bin/spark-class org.apache.spark.deploy.master.Master --host hadoop01 --port 7077 --webui-port 808
查看spark目录下logs:
Spark Command: /usr/java/jdk1.8.0_131/bin/java -cp /home/hadoop/spark-2.4.4-bin-without-hadoop/conf/:/home/hadoop/spark-2.4.4-bin-without-hadoop/jars/*:/home/hadoop/hadoop-3.2.0/etc/hadoop/ -Xmx1g org.apache.spark.deploy.master.Master --host hadoop01 --port 7077 --webui-port 8080
========================================
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger
    at java.lang.Class.getDeclaredMethods0(Native Method)
    at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
    at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
    at java.lang.Class.getMethod0(Class.java:3018)
    at java.lang.Class.getMethod(Class.java:1784)
    at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
    at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 7 more

Solution: Add at the end of the spark-env.sh "export SPARK_DIST_CLASSPATH = $ (/

home / hadoop / hadoop-3.2.0 / bin / hadoop classpath)" Start spark again:

[hadoop@hadoop01 spark-2.4.4-bin-without-hadoop]$ sbin/start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /home/hadoop/spark-2.4.4-bin-without-hadoop/logs/spark-hadoop-org.apache.spark.deploy.master.Master-1-hadoop01.out
hadoop03: starting org.apache.spark.deploy.worker.Worker, logging to /home/hadoop/spark-2.4.4-bin-without-hadoop/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-hadoop03.out
hadoop02: starting org.apache.spark.deploy.worker.Worker, logging to /home/hadoop/spark-2.4.4-bin-without-hadoop/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-hadoop02.out
[hadoop@hadoop01 spark-2.4.4-bin-without-hadoop]$ jps
29200 NodeManager
28595 DataNode
29059 ResourceManager
28804 SecondaryNameNode
30564 Master
28424 NameNode
30777 Jps
[hadoop@hadoop01 spark-2.4.4-bin-without-hadoop]$ ssh hadoop02
Last login: Wed Oct  2 11:21:48 2019 from hadoop01
[hadoop@hadoop02 ~]$ jps
19905 Worker
20098 Jps
18922 DataNode
19054 NodeManager
[hadoop@hadoop02 ~]$ ssh hadoop03
Last login: Tue Oct  1 15:58:07 2019 from hadoop02
[hadoop@hadoop03 ~]$ jps
18896 Jps
17699 DataNode
17829 NodeManager
18694 Worker


Note 1: There are three nodes by jps see and process Master Worker, where Master is the master of the process spark
Note 2: Also note that as the spark to start hadoop command, so start hadoop and spark when the need for clear specified directory, such as: hadoop-3.2.0 / sbin / start-all.sh or spark2.4.4 / sbin / start-all.sh

4.2.6.2 spark provided by a web interface to view the status of the system, each node can be:
HTTP: //http://192.168.1.100:8080/
HTTP: // HTTP: //192.168.1.101: 8081 /
HTTP: // HTTP: //192.168.1.102: 8081 /


4.2.7 stop spark

[hadoop@hadoop01 spark-2.4.4-bin-without-hadoop]$ sbin/stop-all.sh
hadoop02: stopping org.apache.spark.deploy.worker.Worker
hadoop03: stopping org.apache.spark.deploy.worker.Worker
stopping org.apache.spark.deploy.master.Master
[hadoop@hadoop01 spark-2.4.4-bin-without-hadoop]$ jps
29200 NodeManager
28595 DataNode
29059 ResourceManager
28804 SecondaryNameNode
28424 NameNode
31112 Jps


Simple to use 4.2.8 spark of

4.2.8.1 execute the sample program
are some examples of Spark program in ./examples/src/main directory, there are versions of Scala, Java, Python, R and other languages. We can first run a sample program SparkPi (i.e., calculates the approximation π), and the following command:
Note: YES output is very much operation information is executed, the output is not easy to find, may be filtered (command 2 by grep command> & 1 can have all the information to stdout, otherwise due to the nature of the log output, or output to the screen):

[hadoop@hadoop01 spark-2.4.4-bin-without-hadoop]$ ./bin/run-example SparkPi 2>&1 | grep "Pi is roughly"
Pi is roughly 3.135075675378377

Python version SparkPi, you need to run through the spark-submit:

[hadoop@hadoop01 spark-2.4.4-bin-without-hadoop]$ ./bin/spark-submit examples/src/main/python/pi.py 2>&1 | grep "Pi is roughly"
Pi is roughly 3.144580


4.2.8.2 enter shell mode:

[hadoop@hadoop01 jars]$ spark-shell
2019-10-02 21:05:49,292 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://hadoop01:4040
Spark context available as 'sc' (master = local[*], app id = local-1570021558420).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/
         
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 


Attachment:

[Turn] distinguish three deployment modes Spark cluster

Spark's main resource management by rating for the Hadoop Yarn, Apache Standalone and Mesos. When used alone, Spark may also be used basic local mode.

Apache Spark currently supports three distributed deployment, are standalone, spark on mesos and spark on YARN, which is similar to the first mode MapReduce 1.0 employed, the internal implementation of fault tolerance and resource management, the latter two is the future trend, part of the fault tolerance and resource management handed over to a unified resource management system to complete: let Spark runs on a common resource management system, so you can calculate with other frameworks, such as MapReduce, a cluster public resources, maximum the benefit is to reduce operational costs and improve resource utilization (based resource allocation). This article will introduce three deployment, and compare their advantages and disadvantages.


1. Standalone mode

that is independent mode, comes with a complete service, can be deployed to a single cluster, without relying on any other resource management systems. To some extent, this pattern is the basis of the other two. Learn Spark development model, we can get a general idea of a development of novel computational framework: first to design its standalone mode, in order to quickly develop, initially you need to consider fault tolerance service (such as master / slave), followed by the corresponding redevelopment wrapper, the service mode under stanlone intact deployed to the resource management system yarn or mesos, by the resource management system is responsible for the service itself fault tolerance. Spark currently in standalone mode, there is no single point of failure, which is achieved by means of zookeeper, idea is similar to Hbase master single point of failure solution. The Spark standalone comparison with MapReduce, you will find them on the two architectures are exactly the same:
1) is a master / slaves service thereof, and at first master both single point of failure, and later were resolved by zookeeper (Apache MRv1 of JobTracker still remains a single point problem, but CDH version have been resolved);
2) each node resources are abstracted into slot coarse-grained, how many how many slot task can be run simultaneously. The difference is, MapReduce map slot into the slot and reduce slot, which are available only to Reduce Task Map Task and use, and can not be shared, which is one reason MapReduce inefficient resource interest rates, while some of the Spark is more optimized, it does not distinguish between types of slot, only one slot, can be used for various types of Task use, this approach can improve resource utilization, but not flexible enough to customize resources for different types of slot Task. In short, these two methods have advantages and disadvantages.
 

2. Spark On Mesos mode

This mode is used by many companies, the official recommended this model (of course, is one of the reasons kinship). It is because of Spark was developed with consideration to support Mesos, therefore, for now, Spark will run is more flexible than running on the YARN on the Mesos, more natural. Currently in Spark On Mesos environment, the user can select one of two modes scheduled to run their own applications (refer to Andrew Xia's "Mesos Scheduling Mode on Spark") :
1) coarse-grained model (Coarse-grained Mode): each of the application execution environment and by a plurality of Executor Dirver composition, wherein each of the plurality of occupied Executor resources, can run multiple internal Task (number of "slot" corresponds ). Before each mission's formal application to run, you need to run all application resources in the environment is good, and during the operation to have been occupied by these resources, even if not, after the end of the last program run, recovery of these resources. For example, such as when you submit an application, specify five executor run your application, every executor occupy 5GB of memory and five CPU, each internal executor set up five slot, you need to assign executor Mesos resources and launch them, after the start of scheduled tasks. Further, the program is running, mesos master and slave does not know the internal operation of each task executor, executor directly reported to the Driver task state through internal communication mechanism, and to some extent it can be considered for each application using mesos build a virtual cluster own use.

2) fine-grained pattern (Fine-grained Mode): In view of the coarse-grained model will cause a lot of waste of resources, Spark On Mesos also provides another scheduling mode: fine-grained model that is similar to the current cloud computing, it is thought DAMA. And coarse-grained patterns, when the application starts, will start the executor before, but each executor footprint is only required to run their own resources, without considering the task to be run in the future, then, mesos dynamically allocated for each executor resource allocation for each number, you can run a new task, after the completion of a single task run may release the corresponding resources immediately. Each Task will report to Mesos slave state and Mesos Master, more fine-grained to facilitate management and fault tolerance, this scheduling mode is similar MapReduce scheduling mode, each completely independent Task, resource control and the advantage of facilitating isolation, but the disadvantage is also obvious short delay big job runs.


3. Spark On YARN mode

This is a promising deployment model. YARN but limited to their own development, currently only supports coarse-grained model (Coarse-grained Mode). This is due to the Container resources on YARN is not dynamically scalable, Container once started, can be used resources can no longer change, but this has been in the YARN plan.
spark on yarn supports two modes:
1) yarn-cluster: suitable for a production environment;
2) yarn-client: suitable for interactive, debugging, now wants to see the app's output

difference yarn-cluster and yarn-client that yarn appMaster, each yarn app instance has a appMaster process, is a container app for the first start; after responsible for requesting resources from ResourceManager, to obtain resources, told NodeManager its launch container. yarn-cluster yarn-client model to achieve internal and have a lot of difference. If you need a production environment, then please select yarn-cluster; and if you just Debug program, you can choose yarn-client.


Summary:
These three methods have their advantages and disadvantages distributed deployment, usually they need to decide which program uses the actual situation. When selecting programs, often they have to consider the company's technology roadmap (using Hadoop ecosystem or other ecosystems), and other related technical talent pool. Spark above to involve many deployment model, what good is difficult to say which model this, you need according to your needs, if you just test Spark Application, you can select the local mode. And if you are not a lot of volume data, Standalone is a good choice. When you need the unified management of cluster resources (Hadoop, Spark, etc.), then you can choose Yarn or mesos, but this will become high maintenance costs.
· From the point of view contrast, mesos Spark seems to be the better choice, and it was officially recommended
· But if you run hadoop and Spark Meanwhile, from the viewpoint of compatibility, Yarn is a better choice. If you only run hadoop, spark. Docker still run on resource management, Mesos more common.
· Standalone for small-scale computing clusters is more suitable!

Guess you like

Origin www.cnblogs.com/CQ-LQJ/p/11618541.html