Spark big data technology and application final summary topic

  1. PySpark starts with Local, yarn, standalone, mesos2, and controls the log level . The effective log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN There are two ways to control the log output content log4j.rootCategory=INFO , console and from pyspark import SparkContext sc =SparkContext("local","First App") sc.setLogLevel("WARN") 3. What is RDD an elastic distributed dataset ? RDD is a read-only collection of partitioned records. RDD can only be created based on deterministic operations on data sets in stable physical storage and other existing RDDs. 4. Shared variables : accumulators and broadcast variables. 5. How to understand the concept of lineage in Spark RDD . RDDs only support coarse-grained transformations, i.e. perform a single operation on a large number of records. The lineage of RDD will record the metadata information and conversion behavior of RDD. When the RDD's internal partition data is lost, it can recalculate and restore the lost partition data based on this information. 6. The dependencies between RDDs are divided into narrow dependencies such as map, filter, union and wide dependencies groupByKey, reduceByKey, sortByKey. 7. Persistence of RDD --persistence method: persist() or cache(), persistent storage level: MEMORY_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2, DISK_ONLY and DISK_ONLY_2. 8. Please write the four creation methods of Spark RDD . 1). Create RDD from the collection. Spark mainly provides two methods: parallelize and makeRDD 2). Create RDD from external storage files 3). Create from other RDD 4). Create RDD directly (new) 5). Create based on Nosql rdd, such as hbase6). Create rdd based on s3, 7). Based on data flow, such as socket to create rdd. 9. Briefly describe the way Spark Streaming obtains data, and write down the method it uses . Socket acquisition, socketTextStream(); HDFS acquisition, textFileStream(); Kafka acquisition, pyspark.streaming.kafka.KafkaUtils; Flume acquisition, pyspark.streaming.flume.flumeUtils10. Please briefly describe the difference and connection between Spark SQL and DataFrame? Difference: RDD is a collection of distributed java objects, but the internal structure of objects is agnostic to RDD. DataFrame is an RDD-based distributed data set that provides detailed structural information, which is equivalent to a table in a relational database. Contact 1. Both are distributed elastic data sets under the spark platform, which provides convenience for processing super-large data. 2. They all have an inert mechanism. When creating and converting, such as the map method, they will not be executed immediately. Only when an Action is encountered 3. It will automatically cache the calculation according to the memory of spark, so that even if the amount of data is large, there is no need to worry about memory overflow 4. The three have the concept of partition 5. The three have many common functions, such as filter, Sorting, etc. 11.Please briefly describe the working principle of SparkStreaming? Spark Streaming accepts real-time data streams from the data stream, and divides the data into several batches, which are then processed by the spark engine, and finally generate result streams in batches. 12. What are the operating modes of spark , and a brief description of each operating mode ?  Local mode: Spark runs on a single machine, generally used for development and testing. Standalone mode: Build a Spark cluster composed of Master+Slave, and Spark runs in the cluster. Spark on Yarn mode: The Spark client directly connects to Yarn without additionally building a Spark cluster. Spark on Mesos mode: The Spark client directly connects to Mesos without additionally building a Spark cluster. 13. The characteristics of streaming data: implementation arrival, independent sequence, large scale, difficult to extract. 14. The basic steps of SparkStreaming : Create DStream according to the data source, customize the processing logic, start the processing process, wait for the processing result, and end the calculation process 15. SparkStreaming data loading steps : initialize sparkcontext, create streamingcontext object, create inputDStream16, Dstream persistence1 .Print () Note: Use Pprint () in Python 2. saveAsTextFiles (prefix, [suffix]) 3. saveAsObjectFiles Python is not available 4. saveAsHadoopFiles (prefix, [suffix]) 5. foreachRDD (func) 17, (1) will The data is read in RDD format. Lines = sc.textFile(“/home/ubuntu/data/blogInfo.txt”) (2) Count how many different user IDs there are. Data = lines.flatMap(lambda x:x.split(“\t”)) Result = data.distinct().count() (3) Calculate the number of followers of each user. Data3 = lines.map(lambda x: x.split(\t)[1]) Result3 = data3.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y).collect( ) (4) Write the result of step (3) into the HDFS file (specific path: hdfs://linux01:9000/out/result.txt). Result3.saveAsTextFile(hdfs://linux01:9000/out/result.txt) 18. (1) Read the data as DataFrame type, and the column names are "time", "name" and "num". data = spark.read.load("/home/ubuntu/data/log.txt", format = "csv", sep="\t", header=True) (2) Register data as SQL table, table name for "log". data.registerTempTable(“log”) (3) Count the total visits of the same web page in the same day. spark.sql("select time,name,sum(num) from log groupby time, "age").orderBy("name").show() (4) Save the data in parquent format, still in the /home/ubuntu/data/ directory. employee.write.save(“/home/ubuntu/data/employee.paquent”) 20、"age").orderBy("name").show() (4) Save the data in parquent format, still in the /home/ubuntu/data/ directory. employee.write.save(“/home/ubuntu/data/employee.paquent”) 20、What does Spark SQL execution include : Operation, Data Source, Result, Optimize. 21. The high-level APIs provided by Spark MLib include : ML Algorithms, Featurization, Pipelines, Utilities22. Features of Spark : fast, versatile, and support multiple resource managers. 23. Spark and Hadoop handle many of the same tasks, but they are different in the following two aspects. (1) There are different ways to solve the problem: hadoop is a distributed data infrastructure, and spark is a tool for processing large data in distributed storage, and does not store distributed data. (2) The two can be combined or separated. Hadoop not only provides the distributed data storage function of hdfs, but also can be processed by mapreduce, and spark can also use other file systems. 24 Spark has advantages over DSM: rdd batch operations will be based on data The storage location is used to schedule tasks; for scan-type operations, memory is insufficient, and partial caching is performed to cache the entire rdd to avoid memory overflow. 25. Features of RDD : sharding, custom sharding calculation functions, interdependence between RDDs, control of the number of shards, block storage using list methods,
  2. Please list 7 action operators of Spark RDD and briefly describe their functions.

    answer:

    reduce(f): Aggregates the elements in the RDD through the specified aggregation method.

    collect(): Returns a list containing all elements of the RDD.

    count(): Count the number of elements in RDD.

    take(n): Get the value of the first n elements in the RDD, and the returned result is a list type.

    first(): Returns the first element in the RDD, and the returned data type is the element type.

    top(n): Returns the top n largest elements in the RDD, and the returned result is a list type.

    saveAsTextFile(): Store the elements in the RDD in the file system in string format.

    foreach(f): traverse each element in the RDD, and process each element in the RDD by passing a custom processing function f.

    foreachPartition(f): Traverse each partition of the RDD while operating on each partition through the passed f.

    List seven transformation operators of Spark RDD and briefly describe their functions.

    answer:

    map: Map the data in RDD one by one, which can be type conversion or value conversion.

    flatMap: First perform a map operation on all elements in the RDD, and then flatten the result.

    filter: Filter the elements in the RDD according to the specified conditions.

    union: Unionizes two RDDs and returns a new RDD.

    intersection: Intersects two RDDs and returns a new RDD whose output does not contain any duplicate elements.

    sortBy: Sort the elements in the RDD by specifying the key.

    mapPartitions: Map operation is performed on each partition of RDD.

Guess you like

Origin blog.csdn.net/qq_56437391/article/details/125299224