Spark and Hadoop understand records

This article records several types of operations and deployment methods of the spark platform, as well as an introduction to the hadoop platform that has more connections with spark. On this basis, the similarities and differences between the hadoop platform and spark are explored.

Spark platform test

1 Spark operation mode

There are two ways to operate spark code-the shell run by the command console and the packaged package, which are implemented through spark-shell and spark-submit respectively. spark-submit is a shell script file in the bin directory of the spark installation directory, used to start applications in the cluster (such as *.py scripts); for the cluster mode supported by spark, spark-submit is unified when submitting applications The interface does not require too many settings. When using spark-submit, the jar package of the application and any jar files included through the -jars option will be automatically transferred to the cluster.

2 How spark runs

Spark methods are mainly divided into the following types:

  1. Local

    ​ In this way, spark is not clustered, which is equivalent to starting a local process and then simulating the operation of jobs in the spark cluster in a process. A spark job corresponds to one or more executor threads in the process and starts to execute, including job scheduling and task allocation

  2. Cluster

    1. Spark comes with standalone mode
      • client mode
      • cluster way
    2. Cluster mode using yarn for resource scheduling
      • client mode
      • cluster way
client mode

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-56t8amF1-1586439263684)(H:/local image/markdow image/1578489805642.png)]

Process

  1. After the task is submitted in the client mode, the Driver process will be started on the client.
  2. Driver will apply to Master for resources to start Application.
  3. The resource application is successful, and the driver sends the task to the worker for execution.
  4. The worker returns the task execution result to the Driver end.
cluster way

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-890tPlA5-1586439263689)(H:/local image/markdow image/1578489684037.png)]

Implementation process

  1. After the client uses the command spark-submit --deploy-mode cluster, the spark-submit process will be started
  2. This process is for the Driver to apply for resources from the Master. The Driver process requires 1G memory and 1Core by default
  3. Master will randomly start the Driver process on a Wworker node
  4. After the driver starts successfully, spark-submit is closed, and then the driver applies for resources from the master
  5. After the Master receives the request, it will start the Executor process on the worker node with sufficient resources
  6. Driver distributes tasks to Executor for execution

Whether it is the standalone mode of spark or the cluster mode that uses yarn for resource scheduling, the principle of job operation is similar.

For the client and cluster methods, the difference is whether the Driver class is located on the client node that submits the Job job; the standard output of the Job runtime in client mode will be echoed to the client console, which is convenient for debugging, but it will cause a surge in local traffic , Not often used in actual work.

In cluster mode, whether the client and the Spark cluster (Slave cluster) are on the same network segment is not required, because the driver that distributes the jar package and starts the job directly exists on a node in the slave cluster, but in this mode, the Driver class It needs to consume CPU and memory resources on the Slave node, and the standard output of the job will not be echoed back to the client console. For actual work.

3 The actual process of spark
3.1 Binding application dependencies

If the code depends on other projects, in order to distribute the code to the Spark cluster, these dependencies need to be packaged into the application. Both sbt and Maven have assembly plug-ins. As long as the dependencies required by Spark and Hadoop are listed when creating the integrated jar, there is no need to package these dependencies with the application, because the cluster master knows how to call and provide these dependencies when the program is running. ; But once there are integrated jar packages, these jar packages will be delivered when the bin/spark-submit script is executed.

For Python language, you can use spark-submit's –py-files parameter to add .py, .zip, .egg files and distribute them together with the application. If the application depends on multiple Python files, it is recommended to package them into one. zip or .egg file.

3.2 Start the application

If the application is packaged, you can use the bin/spark-submit script to start the application. This script can set the Spark classpath and application dependency packages, and can set different cluster management and deployment modes supported by Spark . After submitting the task, whether it is in Standalone mode or Spark on Yarn mode
, the approximate format of spark-submit application submission is as follows:

./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]

--class:应用程序的入口点(例如,org.apache.spark.examples.SparkPi)
--master:集群的master URL(例如,spark://localhost:7077)
--deploy-mode:将driver部署到worker节点(cluster模式)或者作为外部客户端部署到本地(client模式),默认情况下是client模式
--conf:用key=value格式强制指定Spark配置属性,用引号括起来
--application-jar:包含应用程序和所有依赖的jar包的路径,路径必须是在集群中是全局可见的,例如,hdfs://路径或者file://路径
--application-arguments:传递给主类中main函数的参数
3.3 Master URLs

The master url passed to Spark can be in any of the following formats, corresponding to the Spark cluster mode:

master URL significance
local Use 1 worker thread to run Spark locally (that is, no parallelization at all)
local[K] Use K worker threads to run Spark locally (it is best to set K to the number of CPU cores of the machine)
local[*] According to the number of logical CPU cores of the machine, use Worker threads as much as possible
spark://HOST:PORT Connect to the Master of a given Spark Standalone cluster. This port must be the port configured by the Master. The default is 7077
months: // HOST: PORT To connect to the Master of a given Mesos cluster, this port must be the port configured by the Master, and the default is 5050. If the Mesos cluster uses ZooKeeper, the master URL uses mesos://zk://...
yarn-client Connect to the YARN cluster in client mode, the cluster location will be obtained through the HADOOP_CONF_DIR environment variable
yarn-cluster Connect to the YARN cluster in cluster mode, the cluster location will be obtained through the HADOOP_CONF_DIR environment variable
4. Spark test record
Create hdfs file platform
bin/hdfs dfs -mkdir -p /data/input  #新建文件夹
bin/hdfs dfs -put README.txt /data/input #新建文件到文件夹
bin/hdfs dfs -ls /data/input #ls
bin/hdfs dfs -chmod -R 777 /data/input #避免权限问题,修改权限
spark-shell-local
var textFile=sc.textFile("hdfs://master:9000/data/input/README.txt")#读入文件

var wordCount=textFile.flatMap(line => line.split(" ")).map(word=>(word,1)).reduceByKey((a,b)=>a+b) #进行map与reduce的定义

textFile.saveAsTextFile("hdfs://master:9000/data/output")  #保存至本地
spark-submit-local
bin/spark-submit --class org.apache.spark.examples.SparkPi --master local examples/jars/spark-examples_2.11-2.2.0.jar#提交spark自带的example
spark-standalone
bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master local[8] \
  examples/jars/spark-examples_2.11-2.2.0.jar \
  100
#使用local【8】方式
bin/spark-submit   --class org.apache.spark.examples.SparkPi   --master spark://master:7077 --deploy-mode client --num-executors 2 examples/jars/spark-examples_2.11-2.2.0.jar 
#使用client方式
bin/spark-submit   --class org.apache.spark.examples.SparkPi   --master spark://master:6066   --deploy-mode cluster   --supervise   --driver-cores 4   --executor-memory 512M   --driver-memory 512M   --total-executor-cores 4   examples/jars/spark-examples_2.11-2.2.0.jar  
#使用cluster方式
spark-yarn
bin/spark-submit   --class org.apache.spark.examples.SparkPi   --master yarn --deploy-mode client examples/jars/spark-examples_2.11-2.2.0.jar 

bin/spark-submit   --class org.apache.spark.examples.SparkPi   --master yarn --deploy-mode cluster examples/jars/spark-examples_2.11-2.2.0.jar 

Python reads hdfs

1 Configure pyhdfs

There are two main ways for python to call hdfs:

  1. Call shell commands through python to achieve indirect operation, this kind of operation is relatively simple, but the scalability is poor
  2. Install the python dependency library, which encapsulates a series of function interfaces, which can realize the viewing of the hdfs platform files, adding merging and deleting

It is very troublesome to install and configure pyhdfs manually. It is recommended to install pip directly, and then install pyhdfs directly through pip

sudo apt-get install python-pip#安装pip

pip install --upgrade pip#可以选择升级pip

pip install pyHdfs#安装依赖库

Use pyhdfs to achieve access to the hdfs platform, just introduce pyhdfs at the beginning of the file

2 python calls hdfs
client = pyhdfs.HdfsClient(hosts="192.168.40.30,9000",user_name="janspiry")#初始化

client = pyhdfs.HdfsClient(hosts="192.168.40.30,9000")

client.listdir("/data/input")#显示目录

response = client.open("/data/input/README.txt")#读取平台文件
response.read()

client.copy_from_local("/home/janspiry/spark/tmp.txt","/data/input/tmp.txt")#上传到平台

copy_to_local(src, localdest, **kwargs)#拷贝到本地

Hadoop platform

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-51ltU6DM-1586439263691)(H:/local image/markdow image/1578490043161.png)]

HDFS and MapReduce are the two cores of Hadoop, which are used for distributed storage and parallel processing respectively. After Hadoop 2.0, YARN-a resource management framework is added on the basis of HDFS. MapReduce can be placed on YARN, and other computing resources can also be placed, mainly to manage resources, such as CPU, hard disk, memory, and network. Wait.

Architecture understanding
HDFS

There are NameNode and DataNode. NameNode is the entire file system directory. Based on memory storage, it stores detailed information about some files, such as file name, file size, creation time, and file location. Datanode is the data information of the file, that is, the file itself, but a small file after splitting.

Hadoop Distributed File System (HDFS) is designed as a distributed file system suitable for running on general-purpose hardware. HDFS is a highly fault-tolerant system that can provide high-throughput data access and is very suitable for applications on large-scale data sets.

Data block (block): Large files will be divided into multiple blocks for storage. The default block size is 64MB. Each block will store multiple copies on multiple datanodes, and the default is 3 copies.
NameNode: The namenode is responsible for managing the file directory, the correspondence between files and blocks, and the correspondence between blocks and datanodes.
DataNode: The datanode is responsible for storage. Of course, most of the fault tolerance mechanisms are implemented on the datanode.

MapReduce

MapReduce is a programming model for parallel operations on large-scale data sets (greater than 1TB)

Job Tracker responsibilities: accept tasks, compute resources, allocate resources, and monitor data nodes

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-ON6rAy4K-1586439263697)(H:/local image/markdow image/1578141310755.png)]

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-Igk4GNFn-1586439263698) (H:/local image/markdow image/20170924165441654.png)]

MapReduce runs in the parallel computing process of large-scale clusters and is abstracted into two functions: map and reduce. A MapReduce job will divide the input data into several blocks, and process them in a parallel manner by the map task (task). The framework will sort the output of the map first, and combine all the values ​​with the same key value in Together, then input to the reduce task. Next, the reduce task performs corresponding parallel calculations. Finally, output the calculated results to HDFS for storage. There are several points that need to be memorized. The execution of mapreduce jobs involves 4 independent entities:
(1) Client: write mapreduce programs, configure jobs, and submit jobs. This is what programmers do;
(2) JobTracker: Initialize the job, assign the job, communicate with the TaskTracker, and coordinate the execution of the entire job;
(3) TaskTracker: maintain communication with the JobTracker, execute Map or Reduce tasks on the allocated data fragments, there is a very important difference between TaskTracker and JobTracker , That is, there can be more than n TaskTrackers when executing tasks, and only one JobTracker (JobTracker can only have one, just like the namenode in hdfs, there is a single point of failure, I will talk about this problem in the following mapreduce related issues a)
(. 4) the hDFS: save job data, configuration information, etc., the final result is stored in the above hdfs

Yarn

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-NOlCJ88z-1586439263704)(H:/local image/markdow image/1578141339505.png)]

(1) Resource Manager
accepts distributed computing programs submitted by users and divides resources for them.
Manage and monitor the resource conditions on each Node Manager to facilitate load balancing.
(2) Node Manager
manages the computing resources (cpu + memory) of the machine where it is located.
Responsible for accepting tasks assigned by Resource Manager, creating containers, and recycling resources.

Why Yarn exists
  1. Single-point problems of JobTracker and NameNode, JobTracker nodes have a large workload and are prone to problems.

  2. MapReduce uses a slot-based resource allocation model. Slot is a coarse-grained resource division unit. Usually a task will not use up the resources corresponding to the slot, and other tasks cannot use these free resources. At the same time, the slot of the map is The reduce slot is not universal.

After Yarn was refactored, the functions of JobTracker were distributed to various processes. The function of JobTracker is distributed to various processes including ResourceManager and NodeManager: things related to the calculation model can be placed in an extended service of NodeManager, such as MAP-REDUCE's shuffle. The monitoring function is divided into NodeManager and Application Master. At the same time, because these processes can be deployed separately, this greatly reduces single points of failure and pressure.

At the same time, Yarn uses Container, and Hadoop 1.x uses slot. details as follows:

Three parts of JobTracker:

  1. ResourceManager, responsible for Scheduler and ApplicationsManager (application management);

    For the scheduler Scheduler, YARN provides a variety of directly available schedulers, such as Fair Scheduler and Capacity Scheduler. The scheduler only allocates resources according to the resource requirements of each application. The basic unit of allocation is the Container, and the container encapsulates the memory, CPU, network, and disk together. At the same time, users can also design their own scheduler.

  2. ApplicationMaster, responsible for job lifecycle management;

  3. JobHistroyServer is responsible for the display of logs.

TaskTracker is replaced by NodeManager.

NodeManager manages various containers. Container is the real work. It encapsulates the multi-dimensional resources on a node, such as memory, CPU, disk, network, etc. When AM applies for resources from RM, the resource returned by RM for AM is represented by Container. YARN will assign a Container to each task, and the task can only use the resources described in the Container. The container is a dynamically partitioned resource.

Comparison of Spark and Hadoop
  1. Spark is based on the ability to process data based on memory, and the intermediate output results of the job can be stored in memory, eliminating the need to read and write HDFS.

    The design of MapReduce: The intermediate results are stored in files, which improves reliability and reduces memory usage. But performance is sacrificed.
    Spark's design: Data is exchanged in memory, which is faster, but memory is not as reliable as disk. So performance is better than MapReduce.

  2. There are DAG directed acyclic graphs in Spark.

    The fundamental reason why Spark computing is faster than MapReduce is the DAG computing model. Generally speaking, DAG can reduce the number of shuffles in most cases compared to MapReduce. Spark's DAGScheduler is equivalent to an improved version of MapReduce. If the calculation does not involve data exchange with other nodes, Spark can complete these operations in memory at one time, that is, intermediate results do not need to be placed on the disk, reducing disk IO operations. However, if data exchange is involved in the calculation process, Spark will also write the shuffle data to disk! There is a misunderstanding. Spark is calculation based on memory, so it is fast. This is not the main reason. To calculate the data, it must be loaded into memory. The same is true for Hadoop. It is just that Spark supports the Cache for data that needs to be used repeatedly. , To reduce the time-consuming data loading, so Spark is better at running machine learning algorithms (requiring repeated iterations of data). Spark's disk-based computing is also faster than Hadoop.

  3. The number of copies of Spark storage data can be specified, and MR defaults to 3.

    Spark is highly fault-tolerant. It achieves high-efficiency fault tolerance through elastic distributed data set RDD. RDD is a set of distributed read-only data sets stored in the memory of the node. These sets are elastic, and a certain part is lost or wrong. The reconstruction can be achieved through the blood relationship of the calculation process of the entire data set; with mapreduce, the fault tolerance may only be recalculated, and the cost is high.

  4. JVM optimization

    Hadoop starts a JVM every time a MapReduce operation is started, which is a process-based operation. While each MapReduce operation of Spark is thread-based, the JVM is only started once when the Executor is started, and the Task operation of the memory is reused in the thread. It may take several seconds or even ten seconds to start the JVM each time, so when there are more tasks, Hadoop is slower than Spark.

  5. Spark provides operators for various scenarios. There is only map in MR, and reduce is equivalent to map and reduceByKey in Spark.

  6. Spark is a coarse-grained resource application, and Application executes quickly.

    Spark is a coarse-grained resource application, that is, when submitting a spark application, the application will complete the application for all resources. If the resource is not applied, it will wait. If the resource is applied for, the application will be executed. The task does not need to be executed by itself. To apply for resources, the task is executed quickly, and the task will be released after the last task is executed. The advantage is that the execution speed is fast, but the disadvantage is that the cluster cannot be fully utilized.

    MapReduce is a fine-grained resource application. When an application is submitted, when the task is executed, it applies for resources and releases resources by itself. After the task is executed, the resources will be released immediately, and the task execution is slow, and the application execution is relatively slow.

    The advantage is that the cluster resources are fully utilized, but the disadvantage is that the application execution is relatively slow.

  7. The automatic aggregation function on the shuffle map side in Spark is manually set by MR.

  8. The shuffle ByPass mechanism in Spark has its own flexible implementation.

In general, Spark solves the following problems of Hadoop

  • The level of abstraction is low, and you need to write code manually to complete it. It is difficult to use => abstraction based on RDD, and the code of real data processing logic is very short.

  • Only two operations are provided, Map and Reduce, lacking in expressiveness => Many transformations and actions are provided. Many basic operations such as Join and GroupBy have been implemented in RDD transformations and actions.

  • A Job has only two stages (Phase), Map and Reduce. Complex calculations require a large number of Jobs to complete. The dependencies between Jobs are managed by the developers themselves => One Job can contain multiple conversion operations of RDD. Multiple stages can be generated at a time, and if the RDD partitions of multiple map operations remain unchanged, they can be placed in the same Task.

  • The processing logic is hidden in the code details, there is no overall logic => In Scala, through anonymous functions and higher-order functions, the conversion of RDD supports streaming APIs, which can provide an overall view of the processing logic. The code does not contain the implementation details of specific operations, and the logic is clearer.

  • The intermediate results are also stored in the HDFS file system => the intermediate results are stored in the memory. If they cannot be stored in the internal storage, they will be written to the local disk instead of HDFS.

  • ReduceTask needs to wait for all MapTasks to be completed before it can start => The conversion of the same partition constitutes a pipeline to run in one task. Different partitions require Shuffle and are divided into different stages. You need to wait for the previous stage to complete. Start.

  • The latency is high and only batch data processing is applicable. For interactive data processing, the support for real-time data processing is not enough => Discretized Stream is provided to process stream data by splitting the stream into small batches.

  • Poor performance for iterative data processing => Improve the performance of iterative calculations by caching data in memory.

summary

hadoop与yarn的关系
1.Tracker 替换成了 Manager
2.Yarn使用了Container,hadoop1.x中使用slot

hadoop与spark的比较
1.Spark基于内存处理数据,Job中间输出结果可以保存在内存中,从而不再需要读写HDFS。
2.Spark中有DAG有向无环图。
3.Spark存储数据可以指定副本个数,MR默认3个。
4.JVM的优化。MR启动一个Task便会启动一次JVM,基于进程的操作。Spark每次MapReduce操作是基于线程的,只在启动Executor是启动一次JVM,内存的Task操作是在线程复用的。
5.Spark中提供了各种场景的算子,MR中只有map ,reduce 相当于Spark中的map和reduceByKey两个算子。
6.Spark粗粒度资源申请Application执行。
7.Spark shuffle机制有自己灵活的实现。

lot

Comparison of hadoop and spark
1. Spark processes data based on memory, and the intermediate output results of the job can be stored in memory, eliminating the need to read and write HDFS.
2. There is DAG directed acyclic graph in Spark.
3. The number of copies of Spark storage data can be specified, and MR defaults to 3.
4. Optimization of JVM. When MR starts a Task, it starts the JVM once, based on the operation of the process. Each MapReduce operation of Spark is thread-based, and the JVM is started only once when the Executor is started, and Task operations in the memory are reused in threads.
5. Spark provides operators for various scenarios. There is only map in MR, and reduce is equivalent to two operators, map and reduceByKey in Spark.
6. Spark coarse-grained resources apply for Application execution.
7. The Spark shuffle mechanism has its own flexible implementation.


Guess you like

Origin blog.csdn.net/jianglw1/article/details/105420658