Touge Big Data Assignment 7: Spart

Extracurricular homework seven: Spark

  • Job details

content

1. Alibaba Cloud Experiment "Offline Data Analysis Based on EMR": Offline Data Analysis Based on EMR - Yunqi Lab - Online Experiment - Cloud Migration Practice - Alibaba Cloud Developer Community - Alibaba Cloud Official Experiment Platform - Alibaba Cloud Step 2. Log in to the cluster , start the spark shell. Or install and configure Spark-3.1.3 pseudo-distributed on your own virtual machine. Experimental requirements: Refer to textbook 10.5.2 and create a local file named your name.txt. In addition to other text, add multiple strings of your name to the file content.

Count the number of lines with name strings and screenshot the running results. Make word frequency statistics based on your name.txt file and take screenshots of the results.

2. Huawei Cloud Experiment "Spark Statistical Analysis Experiment" KooLabs Cloud Experiment_Online Experiment_Cloud Practice_Cloud Computing Experiment_AI Experiment_Huawei Cloud Official Experiment Platform-Huawei Cloud " Spark-based Correlation Analysis KooLabs Cloud Experiment_Online Experiment_ Cloud Practice_Cloud Computing Experiment_AI Experiment_Huawei Cloud Official Experiment Platform-Huawei Cloud Count the number of rows with name strings and screenshot the running results. "Spark SQL Data Analysis Experiment" KooLabs Cloud Experiment_Online Experiment_Cloud Practice_Cloud Computing Experiment_AI Experiment_Huawei Cloud Official Experiment Platform-Huawei Cloud
3. Briefly answer the content of "Classroom Assessment" 1. What is Spark RDD? Spark RDD is the basic abstraction of Spark and represents a resilient distributed data set. It is a fault-tolerant, parallel data structure that can store and process large-scale data sets distributedly in a cluster.

2. Which data is imported into RDD? Where is the data imported from? To import data into an RDD, you can use SparkContext’s textFile() method. Data can be imported from the local file system or Hadoop Distributed File System (HDFS).

3. How should the program be modified when importing data from HDFS? To import data from HDFS, just change the file path to the path in HDFS, such as hdfs://namenode:port/path/to/file.

4. Create a data table and import data from the table. How should the program be modified? To create a data table and import data from the table, you can create a temporary table using Spark SQL's createOrReplaceTempView() method. Then use Spark SQL's select statement to select data from the table.

5. What conversion operations were used in the experiment? What is the function? Multiple conversion operations were used in the experiment, including map, filter, flatMap, union, distinct, groupByKey, reduceByKey, etc. These operations are used to perform data transformation and processing on Spark RDD, such as selecting specific data from RDD, aggregating data by key, etc.

6. What actions were used in the experiment? What is the function? Multiple action operations were used in the experiment, including count, take, collect, foreach, etc. These operations will trigger the Spark calculation engine to calculate the RDD and return the results, such as returning the number of elements in the RDD, returning the first few elements in the RDD, etc.

4. 10.7 Exercises

1. Spark is a big data computing platform based on memory computing. Let’s describe the main features of Spark (1) Fast calculation: Spark’s memory computing technology can greatly improve the speed of data processing and analysis. (2) Multi-language support: Spark supports multiple programming languages ​​such as Java, Scala, Python and R, making it convenient for developers to program in familiar languages. (3) Supports multiple data sources: Spark can read and process multiple types of data sources, including Hadoop Distributed File System (HDFS), Cassandra, HBase and Amazon S3, etc. (4) Easy to use: Spark provides easy-to-use high-level APIs, such as Spark SQL, DataFrame API and MLlib machine learning library, making it easier for developers to perform data analysis and machine learning tasks. (5) Scalability: Spark supports adding more nodes to the cluster to expand the scale of the system to process larger data sets.

2. The emergence of Spark is to solve the shortcomings of Hadoop MapReduce. Let's list several shortcomings of Hadoop MapReduce and explain the advantages of Spark. Several defects of Hadoop MapReduce include: (1) Inefficiency: Hadoop MapReduce is mainly based on disk-based data reading and writing, so it will be limited by disk I/O speed when processing big data, resulting in low efficiency. (2) Iterative calculations are not supported: After Hadoop MapReduce processes a task, it needs to store the results on the disk, and then re-read the data from the disk for the next calculation. This method is not suitable for iterative calculations, such as in machine learning. iterative algorithm. (3) Not suitable for real-time computing: Hadoop MapReduce is generally batch processing and is not suitable for real-time computing scenarios, such as real-time monitoring of financial transactions. Spark has the following advantages over Hadoop MapReduce (1) Higher performance: Spark uses memory computing technology and has faster data reading and writing speeds, and can provide higher performance and response speed than Hadoop MapReduce. (2) Support iterative computing: Spark uses the RDD (Resilient Distributed Datasets) model to implement distributed data processing and supports iterative computing, making it suitable for iterative computing scenarios such as machine learning. (3) Support real-time computing: Spark provides a streaming processing engine Spark Streaming, which can support real-time computing, such as financial transaction monitoring, network security attack monitoring and other scenarios. (4) Simpler API: Spark provides APIs in multiple programming languages ​​such as Scala, Java and Python. Compared with Hadoop MapReduce, Spark’s API is simpler and easier to use, allowing developers to develop complex distributed systems more quickly. app. (5) More application scenarios: In addition to batch processing, Spark can also be used in application scenarios such as graph computing, SQL processing, and machine learning, with richer uses and a wider scope of application.

3. BDAS, a data analysis software stack proposed by the University of California, Berkeley, believes that current big data processing can be divided into three types. (1) Batch Processing: Split massive data into small parts for processing, suitable for offline processing of large amounts of data. (2) Stream Processing: Push data in and process it in real time. It is suitable for scenarios with high real-time requirements. (3) Interactive Querying: Dynamically returns results based on user queries, which is suitable for scenarios where users have high requirements for the timeliness of data queries. 4. Spark has created a big data ecosystem with integrated structure and diversified functions. Let’s describe Spark’s ecosystem. Spark Core: The core component of Spark, which provides a unified memory computing engine and supports data processing and conversion. Spark SQL: Provides functions such as structured data processing and SQL query, allowing developers to use SQL statements to query and analyze data. Spark Streaming: supports real-time data processing, including streaming data input, processing and output functions, and can be integrated with message queues such as Kafka and Flume. MLlib: Spark's machine learning library, which provides commonly used machine learning algorithms and tools, such as classification, clustering, regression, collaborative filtering, etc. GraphX: Spark’s graph computing framework supports large-scale graph data processing and analysis. SparkR: Provides interfaces and tools for R language developers to use Spark. Tungsten: Spark's memory management and optimization module that significantly improves Spark performance. PySpark: Provides interfaces and tools for Python language developers to use Spark. Spark Job Server: Provides a RESTful API for Spark jobs to facilitate the deployment and management of Spark applications in a distributed environment. Zeppelin: Data visualization and interactive notebook tool that supports multiple programming languages, including Scala, Python, R, etc. To sum up, the Spark ecosystem covers big data and machine learning, stream processing, graph computing and other fields, and provides a very rich set of components and tools, allowing data analysts to more efficiently develop and run large-scale distribution data applications.

5. What benefits can be brought by switching from Hadoop + Storm architecture to Spark architecture? Faster processing speed: Compared with Hadoop, Spark can run tasks in memory, thereby processing data faster. Easier to use: Compared with Storm, Spark provides a simpler and easier-to-use API, allowing developers to write and run large-scale data processing tasks faster. Higher reliability: Spark tasks can be executed simultaneously on different nodes, so even if a node failure occurs, it will not affect the execution of the entire task. Richer ecosystem: Compared with Hadoop and Storm, Spark has a richer ecosystem and supports more data sources and tools, allowing developers to process and analyze data more efficiently. Better real-time processing capabilities: Compared with Hadoop and Storm, Spark is more suitable for real-time data processing and can process streaming data faster and generate results in real time. To sum up, by switching from Hadoop+Storm architecture to Spark architecture, it can bring faster, more reliable, easier to use, and richer data processing capabilities, and improve the efficiency and quality of big data processing.

6. Describe the concept of Spark on YARN. Spark on YARN refers to the method of deploying Spark applications to run on a YARN-based Hadoop cluster. In this mode, Spark runs as a YARN application and can be used with the YARN resource manager to manage cluster resources and perform task scheduling. Specifically, Spark on YARN usually uses the following components: Client: Responsible for submitting Spark jobs to the YARN cluster. Driver: A driver that starts Spark applications in a YARN cluster and is responsible for coordinating tasks and managing individual executors. Executors: Processes started on YARN nodes, responsible for performing specific calculations in Spark tasks and returning results to the driver. Application Master: When a Spark application is started, an application master is created for the application, starts the application in the YARN cluster and manages the executors. Worker: A process started on a YARN node to accept task scheduling and perform specific Spark operations. Resource Manager: One of the core components of YARN, responsible for managing cluster resources and coordinating resource allocation and scheduling of various applications. In Spark on YARN, Spark runs as an application of YARN, which can take full advantage of YARN resource management and task scheduling, and also follows the standard process of Hadoop development and management. Using Spark on YARN can enable Spark applications to utilize Hadoop cluster resources more efficiently, thereby improving the efficiency and quality of big data processing.

7. Let’s describe the following concepts of Spark: RDD, DAG, stage, partition, narrow dependency, wide dependence. RDD (Resilient Distributed Dataset): is the most basic abstraction in Spark, and it is an immutable Distributed collections can be processed in parallel. RDD represents dividing data into many fragments (Partition), and each fragment can be processed by the same or different computer nodes. DAG (Directed Acyclic Graph): In Spark, each RDD will form a DAG, which represents a logical flow of data processing and consists of several RDDs and the dependencies between them. Each of them RDD is a node. Stage: Each RDD in the DAG can be divided into several stages, and each stage contains a set of interdependent RDDs. Partition: means that the data in the RDD is distributed on different nodes according to specific rules, and each node processes its own part of the data. Narrow Dependency: Refers to the situation in the DAG where the child node only depends on certain partition data of the parent node. This dependency allows Spark to perform computational tasks in a parallel manner. Wide Dependency: Refers to the situation in a DAG where a child node depends on all or most of the data of the parent node. This dependency will cause Spark's computing tasks to wait for all parent nodes to complete before starting, and may lead to problems such as data skew or network bottlenecks.

8. Spark's operations on RDD are mainly divided into two types: action and conversion. What is the difference between the two operations? Spark's operations on RDD are mainly divided into two types: action operations and conversion operations. Among them, transformation operations (Transformation) are a type of "lazy" operations, that is, they only record how they should be calculated (without actual calculation), and the calculation will only be actually performed when the action operation is called. The transformation operation can return a new RDD, but the original RDD will not be changed. Action is a type of operation that actually triggers calculations. This operation will trigger the calculation of the entire DAG and may return results on an output medium such as a console. Action operations typically result in the execution of Spark jobs. Simply put, transformation operations are used to define data processing flows, while action operations start executing these operations and return the final results. In this way, Spark can efficiently process and calculate data while avoiding the generation of intermediate data.

Guess you like

Origin blog.csdn.net/qq_50530107/article/details/131261086