Article Directory

1. Install Anaconda on Linux
2. Introduction to PySpark
Three. Case

1. Install Anaconda on Linux

Download Anaconda
https://www.anaconda.com/distribution/
Command to install Anaconda, select yes except for vscode select no
bash Anaconda3-5.1.0-Linux-x86_64.sh

#spark集成
#安装anconda 除了vscode其他一律选yes或者按enter vscode不需要安装选择no
bash /opt/software/Anaconda3-5.0.1-Linux-x86_64.sh
#配置anaconda3环境 集成spark前提安装了spark
echo 'export SPARK_CONF_DIR=$SPARK_HOME/conf' >> /etc/profile
echo 'export ANACONDA_HOME=/root/anaconda3' >> /etc/profile
echo 'export PATH=$PATH:$ANACONDA_HOME/bin' >> /etc/profile
echo 'export PYSPARK_DRIVER_PYTHON=jupyter-notebook' >> /etc/profile
echo 'export PYSPARK_DRIVER_PYTHON_OPTS=" --ip=0.0.0.0 --port=8888 --allow-root"' >> /etc/profile
echo 'export PYSPARK_PYTHON=/root/anaconda3/bin/python' >> /etc/profile
echo 'export PYSPARK_PYTHON=/root/anaconda3/bin/python' >> /opt/install/spark/conf/spark-env.sh
source /etc/profile
cd ~
#生成jupyter配置文件 #若以前有则覆盖
jupyter notebook --generate-config
cd /root/.jupyter/
#修改jupyter登录密码
ipython
#In [1]: from notebook.auth import passwd
#In [2]: passwd()
#Enter password:
#Verify password:
#Out[4]: 'sha1:9a85ae2b62e2:10849310f951734b0e0b1f9615c92f249272b078' 记住这里密码配置文件需要用到
#修改jupyter_notebook_config.py配置文件
echo 'c.Not200000300ebookApp.allow_root=True' >> /root/.jupyter/jupyter_notebook_config.py
echo "c.NotebookApp.ip='*'" >> /root/.jupyter/jupyter_notebook_config.py
echo 'c.NotebookApp.open_browser=False' >> /root/.jupyter/jupyter_notebook_config.py
#将刚才密码输入到这来,放在u后面
echo "c.NotebookApp.password=u'sha1:9a85ae2b62e2:10849310f951734b0e0b1f9615c92f249272b078'" >> /root/.jupyter/jupyter_notebook_config.py
echo 'c.NotebookApp.port=7070' >> /root/.jupyter/jupyter_notebook_config.py
#启动pyspark(先启动spark相关服务)
pyspark

2. Introduction to PySpark

PySpark usage scenarios

Prototype development for big data processing or machine learning
Verification algorithm
execution efficiency may not be high,
rapid development is required
PySpark structure system

Insert picture description here

PySpark package introduction

PySpark
Core Classes:
pyspark.SparkContext
pyspark.RDD
pyspark.sql.SQLContext
pyspark.sql.DataFrame
pyspark.streaming
pyspark.streaming.StreamingContext
pyspark.streaming.DStream
pyspark.ml
pyspark.mllib

Use PySpark to process data

Guide package

from pyspark import SparkContext

Get SparkContext object

SparkContext.getOrCreate()

Create RDD

Does not support makeRDD()
支持 parallelize(), textFile(), wholeTextFiles()

Using anonymous functions in PySpark

Scala language

val a=sc.parallelize(List("dog","tiger","lion","cat","panther","eagle"))
val b=a.map(x=>(x,1))
b.collect

Python language

a=sc.parallelize(("dog","tiger","lion","cat","panther","eagle"))
b=a.map(lambda x:(x,1))
b.collect()

SparkContext.addPyFile

addFile(path, recursive = False)
Receive local files
Get the absolute path of the file through the SparkFiles.get() method
addPyFile( path)
loads an existing Python file
Load an existing file and call its method

#sci.py
def sqrt(num):
        return num * num
def circle_area(r):
        return 3.14 * sqrt(r)

sc.addPyFile("file:///root/sci.py")
from sci import circle_area
sc.parallelize([5, 9, 21]).map(lambda x : circle_area(x)).collect()

Use SparkSQL in PySpark

Guide package

from pyspark.sql import SparkSession

Create SparkSession object

spark = SparkSession.builder.getOrCreate()

Load csv file

spark.read.format("csv").option("header", "true").load("file:///xxx.csv")

Three. Case

1. Data exploration: statistics of overall data information of life expectancy data

from pyspark.sql import SparkSession
# create the spark session
spark = SparkSession.builder.getOrCreate()
from pyspark.sql.functions import col
from pyspark.sql.types import DoubleType
# load the data
df = spark.read.format("csv").option("delimiter", " ").load("file:///root/example/LifeExpentancy.txt") \
    .withColumn("Country", col("_c0")) \
    .withColumn("LifeExp", col("_c2").cast(DoubleType())) \
    .withColumn("Region", col("_c4")) \
    .select(col("Country"), col("LifeExp"), col("Region"))
df.describe("LifeExp").show()

2. Mixed use of Spark and Python third-party libraries

Use Spark to do big data ETL
processed data using Python third-party libraries to analyze or display

Pandas do data analysis
Pandas DataFrame 转 Spark DataFrame

spark.createDataFrame(pandas_df)

Spark DataFrame转Pandas DataFrame

spark_df.toPandas()

Matplotlib realizes data visualization
Scikit-learn completes machine learning
Conversion method between PandasDF and SparkDF

# Pandas DataFrame to Spark DataFrame
import numpy as np
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
pandas_df = pd.read_csv("./products.csv", header=None, usecols=[1, 3, 5])
print(pandas_df)
# convert to Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)     
spark_df.show()
df = spark_df.withColumnRenamed("1", "id").withColumnRenamed("3", "name").withColumnRenamed("5", "remark")
# convert back to Pandas DataFrame
df.toPandas()

3. Use PySpark to explore data through graphs

Divide the data into multiple intervals, and count the number of data in the interval

# from previous LifeExpentancy example
rdd = df.select("LifeExp").rdd.map(lambda x: x[0])
#把数据划为10个区间，并获得每个区间中的数据个数
(countries, bins) = rdd.histogram(10)
print(countries)
print(bins)

import matplotlib.pyplot as plt 
import numpy as np 

plt.hist(rdd.collect(), 10)  # by default the # of bins is 10
plt.title("Life Expectancy Histogram") 
plt.xlabel("Life Expectancy") 
plt.ylabel("Countries")