Article Directory
1. Introduction to PySpark
1. Introduction to Apache Spark
Spark is the top project of the Apache Software Foundation, an open source distributed big data processing framework, dedicated to large-scale data processing, and a unified analysis engine suitable for large-scale data processing;
Compared with Hadoop's MapReduce,
- Spark retains the advantages of MapReduce's scalable, distributed, and fault-tolerant processing framework , making it more efficient and concise to use;
- Spark saves the intermediate data in data analysis in memory, reducing the delay caused by frequent disk read and write ;
- Spark is tightly integrated with object storage COS, HDFS, Apache HBase, etc. of the Hadoop ecosystem ;
With the help of the Spark distributed computing framework, server clusters composed of hundreds or even thousands of servers can be scheduled to calculate massive big data at the PB/EB level;
Spark supports multiple programming languages, including Java, Python, R and Scala, and the corresponding module of the Python language version is PySpark ;
Python is the most widely used language in Spark;
2. Python language version PySpark of Spark
The Python language version of Spark is PySpark, which is a third-party library, officially developed by Spark, and is an API provided by Spark for Python developers;
PySpark allows Python developers to use the Python language to write Spark applications, and use the distributed computing capabilities of the Spark data analysis engine to analyze big data;
PySpark provides a wealth of data processing and analysis function modules:
- Spark Core: PySpark core module, providing Spark basic functions and API;
- Spark SQL: SQL query module, supports multiple data sources, such as: CSV, JSON, Parquet;
- Spark Streaming: real-time stream data processing module, which can process real-time data streams such as Twitter and Flume;
- Spark MLlib : machine learning algorithms and libraries, such as: classification, regression, clustering, etc.;
- Spark GraphFrame: graph processing framework module;
Developers can use the above modules to build complex big data applications;
3. PySpark application scenarios
PySpark can be used as a Python library for data processing and data processing on your own computer; it
can also submit tasks to Spark clusters for distributed cluster computing;
4. Python language usage scenarios
The usage scenarios of the Python language are very rich, and there are the following application scenarios:
- Desktop GUI program development
- Embedded Development
- Test Development / Operation and Maintenance Development
- Web backend development
- Audio and video development
- Image Processing
- game development
- office automation
- scientific research
- big data analysis
- artificial intelligence
Most scenarios have dedicated languages and development platforms, so don’t rashly use Python for development in general areas, such as: in the Web area, Python’s support for it is not very good, and the ecological environment is incomplete;
The Python language is mainly used in the fields of big data and artificial intelligence, and in other fields, Python language development is basically not used;