[Python] PySpark data processing ① ( Introduction to PySpark | Introduction to Apache Spark | Python language version of Spark PySpark | Python language scene )





1. Introduction to PySpark




1. Introduction to Apache Spark


Spark is the top project of the Apache Software Foundation, an open source distributed big data processing framework, dedicated to large-scale data processing, and a unified analysis engine suitable for large-scale data processing;


Compared with Hadoop's MapReduce,

  • Spark retains the advantages of MapReduce's scalable, distributed, and fault-tolerant processing framework , making it more efficient and concise to use;
  • Spark saves the intermediate data in data analysis in memory, reducing the delay caused by frequent disk read and write ;
  • Spark is tightly integrated with object storage COS, HDFS, Apache HBase, etc. of the Hadoop ecosystem ;

With the help of the Spark distributed computing framework, server clusters composed of hundreds or even thousands of servers can be scheduled to calculate massive big data at the PB/EB level;


Spark supports multiple programming languages, including Java, Python, R and Scala, and the corresponding module of the Python language version is PySpark ;

Python is the most widely used language in Spark;


2. Python language version PySpark of Spark


The Python language version of Spark is PySpark, which is a third-party library, officially developed by Spark, and is an API provided by Spark for Python developers;

PySpark allows Python developers to use the Python language to write Spark applications, and use the distributed computing capabilities of the Spark data analysis engine to analyze big data;

PySpark provides a wealth of data processing and analysis function modules:

  • Spark Core: PySpark core module, providing Spark basic functions and API;
  • Spark SQL: SQL query module, supports multiple data sources, such as: CSV, JSON, Parquet;
  • Spark Streaming: real-time stream data processing module, which can process real-time data streams such as Twitter and Flume;
  • Spark MLlib : machine learning algorithms and libraries, such as: classification, regression, clustering, etc.;
  • Spark GraphFrame: graph processing framework module;

Developers can use the above modules to build complex big data applications;


3. PySpark application scenarios


PySpark can be used as a Python library for data processing and data processing on your own computer; it
can also submit tasks to Spark clusters for distributed cluster computing;

insert image description here


4. Python language usage scenarios


The usage scenarios of the Python language are very rich, and there are the following application scenarios:

  • Desktop GUI program development
  • Embedded Development
  • Test Development / Operation and Maintenance Development
  • Web backend development
  • Audio and video development
  • Image Processing
  • game development
  • office automation
  • scientific research
  • big data analysis
  • artificial intelligence

Most scenarios have dedicated languages ​​and development platforms, so don’t rashly use Python for development in general areas, such as: in the Web area, Python’s support for it is not very good, and the ecological environment is incomplete;

The Python language is mainly used in the fields of big data and artificial intelligence, and in other fields, Python language development is basically not used;

Guess you like

Origin blog.csdn.net/han1202012/article/details/131998785