spark study notes __chap4_spark basic principles of programming __4.2_SparkContext entrance

  1. SparkContext is programmed entrance pyspark submit jobs, distribution task, the application will be registered in SparkContext in. A SparkContext instance represents a connection and Spark, only established a connection before they can submit jobs to the cluster. RDD can create and broadcast Broadcast variables after instantiated SparkContext.

  2. Sparkcontext acquiring, starting pyspark --master spark://hadoop-maste:7077after the object may be acquired by SparkSession Sparkcontext printing from the recording point of view, SparkContext master cluster address Spark it is connected to spark://hadoop-maste:7077
    Here Insert Picture Description
    another way of obtaining SparkContext is introduced pyspark.SparkContext created. New parkContext.py file, as follows:

    from pyspark import SparkContext
    from pyspark import SparkConf
    conf = SparkConf()
    conf.set('master','local')
    sparkContext = SparkContext(conf=conf)
    rdd = sparkContext.parallelize(range(100))
    print(rdd.collect())
    sparkContext.stop()
    

    Before running the first spark in the conf directory configuration directory log4j.properties configuration file log level to read as follows:
    Here Insert Picture Description
    Such background printing log will not see too much impression! Spark restart cluster
    running spark-submit sparkContext.py
    Sparkconf objects in the code above is subject to Spark inside the configuration parameters, then we will explain in detail to.

  3. accumulator is a method used to create the Sparkcontext the accumulator. Creating accumulator can be accumulated in each task, and can only support the add operation. This method
    supports initial values passed accumulator. Accumulator to Accumulator herein do addition of 1 to 50 as an example to explain.
    New accumulator.py file, as follows:

    from pyspark import SparkContext,SparkConf
    import numpy as np
    conf = SparkConf()
    conf.set('master','spark://hadoop-maste:7077')
    context = SparkContext(conf=conf)
    acc = context.accumulator(0)
    print(type(acc),acc.value)
    rdd = context.parallelize(np.arange(101),5)
    def acc_add(a):
    	acc.add(a)
    	return a
    rdd2 = rdd.map(acc_add)
    print(rdd2.collect())
    print(acc.value)
    context.stop()
    

    Use spark-submit accumulator.pyRun

  4. addFile method of adding a file, get file using SparkFiles.get method
    This method takes a path, the method will upload the file to the local path to the cluster nodes for each node during the data downloading operation, the path may be a local path is also but is hdfs path, or a http, https, or tfp the uri. If you upload a folder, specify recursize parameter to True. Upload files using SparkFiles.get (filename) way to obtain.

    If a local path, the local requirements of all nodes have the same path of the file, the file to find each node in the respective local directory.

    New addFile.py file, as follows:

    from pyspark import SparkFiles
    import os
    import numpy as np
    from pyspark import SparkContext
    from pyspark import SparkConf
    tempdir = '/root/workspace/'
    path = os.path.join(tempdir,'num_data')
    with open(path,'w') as f:
    	f.write('100')
    conf = SparkConf()
    conf.set('master','spark://hadoop-maste:7077')
    context = SparkContext(conf=conf)
    context.addFile(path)
    rdd = context.parallelize(np.arange(10))
    def fun(iterable):
    	with open(SparkFiles.get('num_data')) as f:
    		value = int(f.readline())
    		return [x*value for x in iterable]
    print(rdd.mapPartitions(fun).collect())
    context.stop()
    

    Run spark-submit addFile.py
    this example is a local file to use, then we try hdfs file in the path.
    New hdfs_addFile.py file, as follows:

    from pyspark import SparkFiles
    import numpy as np
    from pyspark import SparkContext
    from pyspark import SparkConf
    conf = SparkConf()
    conf.set('master','spark://hadoop-maste:7077')
    context = SparkContext(conf=conf)
    path = 'hdfs://hadoop-maste:9000/datas/num_data'
    context.addFile(path)
    rdd = context.parallelize(np.arange(10))
    def fun(iterable):
    	with open(SparkFiles.get('num_data')) as f:
    		value = int(f.readline())
    		return [x*value for x in iterable]
    print(rdd.mapPartitions(fun).collect())
    context.stop()
    
    #查看hdfs的根目录
    hdfs dfs -ls / 
    #在根目录下创建/datas文件夹
    hdfs dfs -mkdir /datas
    # 将本地目录中的num_data文件put至hdfs目录/datas/下
    hdfs dfs -put num_data /datas/num_data
    # 查看是否拷贝成功
    hdfs dfs -ls /datas
    # 打印hdfs目录中/datas/num_data中的内容
    hdfs dfs -cat /datas/num_data
    

    Here Insert Picture Description
    runspark-submit hdfs_addFile.py

Here Insert Picture Description
Note that the default addFile identify the local path, if hdfs path, you need to specify hdfs://hadoop-maste:9000the protocol, uri and port information.
Then look at the network reads the file information. 182.150.37.49 on this machine, I installed httpd server. httpd configuration and installation refer to my notes this: [link]
into / var / www / html / directory of the server machine httpd installed, the new num_data files in this directory, the file content is 100
then write http_addFile.py file read in the code on just the new httpd server num_data file.
It reads as follows:

from pyspark import SparkFiles
import numpy as np
from pyspark import SparkContext
from pyspark import SparkConf
conf = SparkConf()
conf.set('master','spark://hadoop-maste:7077')
context = SparkContext(conf=conf)
path = 'http://192.168.0.6:808/num_data'
context.addFile(path)
rdd = context.parallelize(np.arange(10))
def fun(iterable):
with open(SparkFiles.get('num_data')) as f:
value = int(f.readline())
return [x*value for x in iterable]
print(rdd.mapPartitions(fun).collect())
context.stop()

Run sparksubmit http_addFile.py
from the above three examples, can be seen that power of addFile method. With this method, the task pyspark in operation can read the files of virtually any position to participate meter
count!
5. applicationId, to get registered with the cluster of application id

from pyspark import SparkContext ,SparkConf
import numpy as np
conf = SparkConf()
conf.set('master','spark://hadoop-maste:7077')
context = SparkContext(conf=conf)
rdd = context.parallelize(np.arange(10))
print('applicationId:',context.applicationId)
print(rdd.collect())
context.stop()
  1. binaryFiles read binary files.
    The method for reading a binary file such as audio, video, pictures, returns a tuple for each file, a first element of the tuple for the path to the file, the second parameter is the two
    contents binary file.
    I uploaded in the hdfs / datas / pics directory under the two pictures, using binaryFiles read binary picture data / datas / pics directory
    New binaryFiles.py file, as follows:
from pyspark import SparkContext ,SparkConf
import numpy as np
conf = SparkConf()
conf.set('master','spark://hadoop-maste:7077')
context = SparkContext(conf=conf)
rdd = context.binaryFiles('/datas/pics/')
print('applicationId:',context.applicationId)
result = rdd.collect()
for data in result:
print(data[0],data[1][:10])
context.stop()

Run spark-submit binaryFiles.py
this method is very easy to read binary data files, especially with pictures, audio and video time.

  1. broadcast broadcasting variables
    broadcast method for creating broadcast on SparkContext variables, shared variables for more than 5M recommended broadcast. Broadcast mechanism can minimize network IO, so as to enhance performance. Subsequently example, broadcast a 'hello' string, each task receives the broadcast variables, the stitching returns. New broadcast.py file, as follows.
from pyspark import SparkContext ,SparkConf
import numpy as np
conf = SparkConf()
conf.set('master','spark://hadoop-maste:7077')
context = SparkContext(conf=conf)
broad = context.broadcast(' hello ')
rdd = context.parallelize(np.arange(27),3)
print('applicationId:',context.applicationId)
print(rdd.map(lambda x:str(x)+broad.value).collect())
context.stop()

Run spark-submit broadcast.py
From the results, distributed run each task are received by the variable hello broadcast.
Here Insert Picture Description

  1. defaultMinPartitions.py get the default minimum number of partitions
from pyspark import SparkContext ,SparkConf
import numpy as np
conf = SparkConf()
conf.set('master','spark://hadoop-maste:7077')
context = SparkContext(conf=conf)
print('defaultMinPartitions:',context.defaultMinPartitions)
context.stop()
  1. emptyRDD create an empty RDD, the RDD no partition, nor any data

    sc = sparkContext
    sc.emptyRDD()
    rdd.collect()
    rdd.getNumPartitions()
    

    ![(https://img-blog.csdnimg.cn/20200315175837246.png)

  2. getconf () method returns the configuration information of the object job
    sc.getConf().toDebugString()
    Here Insert Picture Description

  3. getLocalProperty and setLocalProperty get and set attribute information in local thread. By setLocalProperty setting, the property is set to work only on the current thread submit, not for other jobs. key: value type of property

  4. setLogLevel Set the log level, this setting will override the log level set any user-defined. Value has: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN output by comparing the different levels of both logs, the log can be seen that output different levels of logging are different, can be selected by this the appropriate log level debug.
    Here Insert Picture Description
    Here Insert Picture Description

  5. getOrCreate obtain or create a SparkContext objects, SparkContext objects created by this method is singleton object. The method can accept a Sparkconf objects.
    Here Insert Picture Description

  6. hadoopFile read 'old' is hdfs hadoop interface provides file format.
    `sc.hadoopFile ( '/ DATAS / num_data', inputFormatClass = 'org.apache.hadoop.mapred.TextInputFormat', keyClass = 'org.apache.hadoop.io .Text ', valueClass =' file path for the first parameter, the second parameter is the format of the input file, and the third parameter is the key format, the value of the fourth parameter format

sc.hadoopFile("/datas/num_data", inputFormatClass="org.apache.hadoop.mapred.TextInputFormat", 
	keyClass="org.apache.hadoop.io.Text", valueClass="org.apache.hadoop.io.LongWritable").collect()

Here Insert Picture Description
Will read out the default line number as the key!

  1. textFile and saveAsTextFile reading text files located on the HDFS
    this method to read the type of text located on the most simple file hdfs

Here Insert Picture Description

  1. RDD parallelize using python set creation, function range can be used, of course, it can also be used to create inside arange method numpy
    Here Insert Picture Description

  2. saveAsPickleFile and pickleFile RDD will be saved as compressed files in python pickle format.

    sc.parallelize(range(100),3).saveAsPickleFile('/datas/pickles/bbb', 5)
    sorted(sc.pickleFile('/datas/pickles/bbb', 3).collect())
    

    sorted methods used keyword arguments reverse, set to True

  3. range (start, end = None, step = 1, numSlices = None) provided in accordance with the start value and the step length, creating RDD
    numSlices for designating the number of partitions

    rdd = sc.range(1,100,11,3)
    rdd.collect()
    dir(rdd)
    

    Here Insert Picture Description

  4. runJob (rdd, partitionFunc, partitions = None, allowLocal = False) running on a given partition function specified
    partitions for the specified partition number, a list. If the specified partition by default all partitions run partitionFunc function

    rdd = sc.range(1,1000,11,10)
    sc.runJob(rdd,lambda x:[a for a in x])
    

    0,1,4,6 specified run on a ** 2 Partition function

    rdd = sc.range(1,1000,11,10)
    sc.runJob(rdd,lambda x:[a**2 for a in x],partitions=[0,1,4,6])
    

    Here Insert Picture Description

  5. directory setCheckpointDir (dirName) checkpoint, the checkpoint for error recovery when an exception occurs, the directory must HDFS directory.
    Checkpoint directory / datas / checkpoint /

    sc.setCheckpointDir('/datas/checkpoint/')
    rdd = sc.range(1,1000,11,10)
    rdd.checkpoint()
    rdd.collect()
    

    After the run is complete, view the hdfs / datas / checkpoint directory
    found during the operation of the data to be preserved, if an exception occurs during the Spark program is running, the checkpoint will be used to recover data anomalies.
    Here Insert Picture Description

  6. sparkUser get the current job running the user name
    sc.sparkUser()
    Here Insert Picture Description

  7. Back startTime start time of the job
    sc.startTime
    it returns the Long type millisecond time value, may be converted by means of the time-line tool to view specific time
    Here Insert Picture Description

  8. Jobs of id statusTracker () StatusTracker method for acquiring an object, may be acquired through the active object, stage activity id. The job information, the information stage. You can use this object to monitor the status of data center operations running in real time.

    t = sc.statusTracker()
    dir(t)
    

    Here Insert Picture Description

  9. stop () method is used to stop the connection SparkContext and the cluster. General procedures in writing the last line of this sentence should be added to ensure that the job runs after the completion of the connection and disconnection cluster cluster.
    sc.stop()

  10. uiWebUrl return to the web url
    sc.uiWebUrl
    Here Insert Picture Description

  11. union (rdds) combined with a plurality of rdd rdd

    rdd1 = sc.parallelize(range(5),4)
    rdd2 = sc.parallelize(range(10),3)
    rdd3 = sc.union([rdd1,rdd2])
    rdd3.collect()
    

    Here Insert Picture Description
    Gets the number of partitions rdd:
    rdd.getNumPartitions()
    Here Insert Picture Description

  12. version to get the version number
    sc.version
    Here Insert Picture Description

Published 63 original articles · won praise 52 · views 40000 +

Guess you like

Origin blog.csdn.net/weixin_41521681/article/details/104806417