Spark introduction and build pseudo-distributed cluster

I. Overview

Here Insert Picture Description
Apache Spark is a lightning-fast unified analysis engine (does not provide data storage solutions)

Lightning fast (compared to a conventional large data processing scheme MapReduce):

  • Spark a complex computing tasks Stage Job split into a plurality of fine-grained, each Stage may be a parallel distributed computing; for the first generation MapReduce calculation engine, and it ReduceTask MapTask task split the coarse-grained, particularly for complex computing tasks, we need more than MapReduce job in series;
  • Spark-memory computing engine; MapReduce for disk-based calculation engine;
  • Spark intermediate results support caching greatly improve the computational efficiency, the results can be applied to this cache reuse and recovery; for MapReduce intermediate results need to write in the disk overflow;

Unified (all major programs provide large data processing) :

  • Batch (Batch Processing): Spark RDD, instead of the Hadoop MapReduce
  • Process stream (Streams Processing): Spark Streaming and Spark Structured Streaming, instead Kafka Streams, Storm
  • Interactive Query (SQL): Spark SQL, instead of the Hive
  • Machine Learning (Machine Learning): Spark MLLib, instead of Mahout
  • Graphics computing (Graph): Spark Graphx, calculated based on NoSQL database support graphics stored (Neo4J)
  • Spark ecological library: solve other large data processing problem

Feature

  • Efficient: high performance batch and flow calculations support, using very sophisticated DAG Scheduler (directed acyclic graph scheduler) is a physical query optimization and execution engine
  • Easy to use: Provides more than 80 high-order function simplifies the development of distributed parallel applications, and support for multiple programming languages ​​(Java, Python, Scala, R)
  • General: integrate a variety of data processing programs, such as SQL, Batch, Streaming, ML, Graph;
  • Operating environment: scheduling system supports a variety of resource management, such as YARN, Apache Mesos, K8S, Standalone and so on;

二、MapReduce VS Spark

As the first generation of large MapReduce data processing frame, early in the design just to meet the urgent demand for mass data based on the mass data calculation stage. Since 2006, stripped from Nutch (Java search engine) project, the main problem is the problem of early primary people's awareness of large data face.
Here Insert Picture Description
The entire MapReduce implementation is calculated based on the calculation of disk IO, along with the growing popularity of big data technology, people began to re-define how big data, not satisfied to complete the calculation of large data within a reasonable time frame, also calculated on the effectiveness of proposed more stringent requirements, as people began to explore the use Map Reduce computing framework to complete some complex high-order algorithm, often these algorithms usually not by a single iteration of Map Reduce completed. Because of Map Reduce computing model is always the result is stored to disk, each iteration requires a data disk is loaded into memory, which brings more extended for subsequent iterations.
2009 Spark Laboratory in Berkeley, California AMP born after 2010 the first open source project is loved by many developers, the June 2013 start of the Apache Incubator, February 2014 officially became a top-level Apache project. Spark is growing so quickly is because Spark in computing layer is obviously superior to the Hadoop Map Reduce this disk iterative calculations, because Spark can use the memory on the data to do the calculation, but also intermediate results of calculations can be cached in memory, which is the follow-up the iterative calculation saves time, greatly enhance the computational efficiency for mass data of.
Here Insert Picture Description
Spark also presented in the use and Spark MapReduce do a regression analysis (n algorithm iterations required), Spark rate of almost 10 to 100 times this calculated velocity calculation MapReduce.
Here Insert Picture Description
Not only that Spark in design philosophy also proposed One stack ruled them allstrategy, and provides Spark batch-oriented computing services such as branch-based: Spark-based learning interactive queries, near real-time stream processing, machine to achieve, Grahx graphic relational storage.
Here Insert Picture Description
Not difficult to see from the figure in the calculation layer Apache Spark, Spark project started at the strategic to the connecting role, and did not discard the original large data hadoop as the main solution. Spark-down can be calculated because data from HDFS, HBase, Cassandra, and Amazon S3 file server, which means as calculated using the Spark layer, the original user's storage tier architecture without changes.

Third, the calculation process

Because Spark is calculated after the birth of MapReduce computation, drawing MapReduce design experience, greatly avoid the MapReduce computation process of criticism, let's review the MapReduce computation processes.
Here Insert Picture Description
To summarize a few drawbacks:

1) MapReduce Although vector-based programming ideas, but too simple calculation state, simply the task into Map state and Reduce State, does not take into account the iterative calculation scene.
2) intermediate result stored in the calculation Map task to the local disk, the IO excessive calls, data write efficiency is poor.
3) MapReduce is to submit the task, and then apply resources in the calculation. And calculation methods are too cumbersome. Each degree of parallelism is achieved by a process to calculate a JVM.

By simply lists can easily find the problem and criticized MapReduce computation, so Spark draws on the experience level of computing MapReduce computational design of proposed DGASchedule and TaskSchedual concept, breaking the MapReduce task in a job only and Reduce State Map State the two phases are not suitable for some more iterative calculation of the number of scenes. Spark therefore proposed a more advanced design concepts, split the task status, Spark by State DGASchedule first computing tasks, each stage of the Sate encapsulated into a TaskSet early in the mission computing, then TaskSet submitted by the TaskSchedual cluster computing. Spark flowchart may try to use at the calculated flow is described below:
Here Insert Picture Description
Compared to MapReduce calculation, calculation Spark has the following advantages:

1) Smart DAG task split, will split into a number of complex calculations State, the scene to satisfy the iterative calculation

2) Spark provides a fault-tolerant computing and caching strategies, the results are stored in memory or on disk, speed up running each state, improve operational efficiency

3) Spark early in the calculation, it has been a good application of computing resources. Task parallelism is achieved by starting a thread in the process of Executor, compared to MapReduce computing more brisk.

Spark Cluster Manager currently provides implementations implemented by the Yarn, Standalone, Messso, kubernates and so on. The most common being Yarn and management Standalone mode.

Four, Spark pseudo-distributed cluster environment to build (Standalone)

  • Simulate all cluster services on a single server;
  • Standalone cluster is a spark that comes with managing a resource scheduling system, similar to Hadoop YARN

Ready to work

  • More CentOS7.2 virtual machine
  • JDK8.0 above
  • hadoop-2.9.2.tar.gz
  • spark-2.4.4-bin-without-hadoop.tgz

The initial configuration of the virtual machine

The internet
# 1. 双网卡配置,添加网卡硬件,网络连接设置为NAT模式
# 2. 修改网卡的配置文件,ens33网卡(内网通信,静态地址) ens37网卡(外围通信,动态地址)
[root@SparkOnStandalone ~]# vi /etc/sysconfig/network-scripts/ifcfg-ens33
BOOTPROTO=static
IPADDR=192.168.197.201
NETMASK=255.255.255.0

Turn off the firewall

[root@SparkOnStandalone ~]# systemctl disable firewalld

Set the host name

[root@SparkOnStandalone ~]# vi /etc/hostname
SparkOnStandalone

Configuring the host name and IP mapping

[root@SparkOnStandalone ~]# vi /etc/hosts
192.168.48.201 SparkOnStandalone

Configuring SSH-free dense Login

[root@SparkOnStandalone ~]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Created directory '/root/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:60iw7cJLQf2X1GZIDBq8GKTVy5OmL21xUw5XYecYNSQ root@SparkOnStandalone
The key's randomart image is:
+---[RSA 2048]----+
|   .oo. .o. E+=  |
|   o..oo ..+.* . |
|  . .+o+  o.= .  |
|   .. B...o+     |
|    oo .S=o      |
|    .=. oo.      |
|   .oooo..       |
|   .+o+o         |
|    .=o .        |
+----[SHA256]-----+
[root@SparkOnStandalone ~]# ssh-copy-id SparkOnStandalone
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
The authenticity of host 'sparkonstandalone (192.168.197.201)' can't be established.
ECDSA key fingerprint is SHA256:gsJGwQ2NjqNlR1WjbLzlxJQnv24GdfTiqDm/hu4d7+s.
ECDSA key fingerprint is MD5:11:f9:7c:cd:f3:e1:b9:89:60:71:51:dd:b5:1e:49:0e.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys

# 验证免密登录
[root@SparkOnStandalone ~]# ssh SparkOnStandalone
Last login: Mon Jan 13 11:23:33 2020 from 192.168.197.1
[root@SparkOnStandalone ~]# exit
登出
Connection to sparkonstandalone closed.

Install the JDK

[root@SparkOnStandalone ~]# rpm -ivh jdk-8u171-linux-x64.rpm

Install Hadoop HDFS

[root@SparkOnStandalone ~]# tar -zxf hadoop-2.9.2.tar.gz -C /usr

[root@SparkOnStandalone ~]# vi /usr/hadoop-2.9.2/etc/hadoop/core-site.xml
<!--nn访问入口-->
<property>
    <name>fs.defaultFS</name>
    <value>hdfs://SparkOnStandalone:9000</value>
</property>
<!--hdfs工作基础目录-->
<property>
    <name>hadoop.tmp.dir</name>
    <value>/usr/hadoop-2.9.2/hadoop-${user.name}</value>
</property>

[root@SparkOnStandalone ~]# vi /usr/hadoop-2.9.2/etc/hadoop/hdfs-site.xml
<!--block副本因子-->
<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>
<!--配置Sencondary namenode所在物理主机-->
<property>
    <name>dfs.namenode.secondary.http-address</name>
    <value>SparkOnStandalone:50090</value>
</property>
<!--设置datanode最大文件操作数-->
<property>
        <name>dfs.datanode.max.xcievers</name>
        <value>4096</value>
</property>
<!--设置datanode并行处理能力-->
<property>
        <name>dfs.datanode.handler.count</name>
        <value>6</value>
</property>

[root@SparkOnStandalone ~]# vi /usr/hadoop-2.9.2/etc/hadoop/slaves
SparkOnStandalone

Modify environment configuration

[root@SparkOnStandalone ~]# vi .bashrc
HADOOP_HOME=/usr/hadoop-2.9.2
JAVA_HOME=/usr/java/latest
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
CLASSPATH=.
export HADOOP_HOME
export JAVA_HOME
export PATH
export CLASSPATH
[root@SparkOnStandalone ~]# source .bashrc

Initialization NameNode

[root@SparkOnStandalone ~]# hdfs namenode -format

Start HDFS

[root@SparkOnStandalone ~]# start-dfs.sh
[root@SparkOnStandalone ~]# jps
11266 NameNode
11847 Jps
11416 DataNode
11611 SecondaryNameNode

Installation Spark

[root@SparkOnStandalone ~]# tar -zxf spark-2.4.4-bin-without-hadoop.tgz -C /usr
[root@SparkOnStandalone ~]# cd /usr/
[root@SparkOnStandalone usr]# mv spark-2.4.4-bin-without-hadoop/ spark-2.4.4

Configuration

[root@SparkOnStandalone conf]# cp slaves.template slaves
[root@SparkOnStandalone conf]# cp spark-env.sh.template spark-env.sh
[root@SparkOnStandalone conf]# cp spark-defaults.conf.template spark-defaults.conf

[root@SparkOnStandalone conf]# vi spark-env.sh
SPARK_WORKER_INSTANCES=1
SPARK_MASTER_HOST=SparkOnStandalone
SPARK_MASTER_PORT=7077
SPARK_WORKER_CORES=4
SPARK_WORKER_MEMORY=2g
LD_LIBRARY_PATH=/usr/hadoop-2.9.2/lib/native
SPARK_DIST_CLASSPATH=$(hadoop classpath)
export SPARK_MASTER_HOST
export SPARK_MASTER_PORT
export SPARK_WORKER_CORES
export SPARK_WORKER_MEMORY
export LD_LIBRARY_PATH
export SPARK_DIST_CLASSPATH
export SPARK_WORKER_INSTANCES

[root@SparkOnStandalone conf]# vi slaves
SparkOnStandalone

Start Spark Standalone mode cluster (pseudo-distributed)

[root@SparkOnStandalone spark-2.4.4]#./sbin/start-all.sh

Authentication Service

# 方法1
[root@SparkOnStandalone sbin]# jps
17216 Master
11266 NameNode
17347 Worker
11416 DataNode
11611 SecondaryNameNode
17612 Jps

# 方法2
http://192.168.48.201:8080/
Published 24 original articles · won praise 1 · views 502

Guess you like

Origin blog.csdn.net/Mr_YXX/article/details/105001190