PySpark architecture and Jupyter Notebook integrated environment construction

1. Install Anaconda on Linux

  • Download Anaconda
    https://www.anaconda.com/distribution/
  • Command to install Anaconda, select yes except for vscode select no
    bash Anaconda3-5.1.0-Linux-x86_64.sh
#spark集成
#安装anconda 除了vscode其他一律选yes或者按enter vscode不需要安装选择no
bash /opt/software/Anaconda3-5.0.1-Linux-x86_64.sh
#配置anaconda3环境 集成spark前提安装了spark
echo 'export SPARK_CONF_DIR=$SPARK_HOME/conf' >> /etc/profile
echo 'export ANACONDA_HOME=/root/anaconda3' >> /etc/profile
echo 'export PATH=$PATH:$ANACONDA_HOME/bin' >> /etc/profile
echo 'export PYSPARK_DRIVER_PYTHON=jupyter-notebook' >> /etc/profile
echo 'export PYSPARK_DRIVER_PYTHON_OPTS=" --ip=0.0.0.0 --port=8888 --allow-root"' >> /etc/profile
echo 'export PYSPARK_PYTHON=/root/anaconda3/bin/python' >> /etc/profile
echo 'export PYSPARK_PYTHON=/root/anaconda3/bin/python' >> /opt/install/spark/conf/spark-env.sh
source /etc/profile
cd ~
#生成jupyter配置文件 #若以前有则覆盖
jupyter notebook --generate-config
cd /root/.jupyter/
#修改jupyter登录密码
ipython
#In [1]: from notebook.auth import passwd
#In [2]: passwd()
#Enter password:
#Verify password:
#Out[4]: 'sha1:9a85ae2b62e2:10849310f951734b0e0b1f9615c92f249272b078' 记住这里密码配置文件需要用到
#修改jupyter_notebook_config.py配置文件
echo 'c.Not200000300ebookApp.allow_root=True' >> /root/.jupyter/jupyter_notebook_config.py
echo "c.NotebookApp.ip='*'" >> /root/.jupyter/jupyter_notebook_config.py
echo 'c.NotebookApp.open_browser=False' >> /root/.jupyter/jupyter_notebook_config.py
#将刚才密码输入到这来,放在u后面
echo "c.NotebookApp.password=u'sha1:9a85ae2b62e2:10849310f951734b0e0b1f9615c92f249272b078'" >> /root/.jupyter/jupyter_notebook_config.py
echo 'c.NotebookApp.port=7070' >> /root/.jupyter/jupyter_notebook_config.py
#启动pyspark(先启动spark相关服务)
pyspark

2. Introduction to PySpark

PySpark usage scenarios

  • Prototype development for big data processing or machine learning
    Verification algorithm
    execution efficiency may not be high,
    rapid development is required
  • PySpark structure system

Insert picture description here

PySpark package introduction

  • PySpark
    Core Classes:
    pyspark.SparkContext
    pyspark.RDD
    pyspark.sql.SQLContext
    pyspark.sql.DataFrame
  • pyspark.streaming
    pyspark.streaming.StreamingContext
    pyspark.streaming.DStream
  • pyspark.ml
  • pyspark.mllib

Use PySpark to process data

  • Guide package
from pyspark import SparkContext
  • Get SparkContext object
SparkContext.getOrCreate()

Create RDD

  • Does not support makeRDD()
  • 支持 parallelize(), textFile(), wholeTextFiles()

Using anonymous functions in PySpark

  • Scala language
val a=sc.parallelize(List("dog","tiger","lion","cat","panther","eagle"))
val b=a.map(x=>(x,1))
b.collect
  • Python language
a=sc.parallelize(("dog","tiger","lion","cat","panther","eagle"))
b=a.map(lambda x:(x,1))
b.collect()

SparkContext.addPyFile

  • addFile(path, recursive = False)
    Receive local files
    Get the absolute path of the file through the SparkFiles.get() method
  • addPyFile( path)
    loads an existing Python file
  • Load an existing file and call its method
#sci.py
def sqrt(num):
        return num * num
def circle_area(r):
        return 3.14 * sqrt(r)
sc.addPyFile("file:///root/sci.py")
from sci import circle_area
sc.parallelize([5, 9, 21]).map(lambda x : circle_area(x)).collect()

Use SparkSQL in PySpark

  • Guide package
from pyspark.sql import SparkSession
  • Create SparkSession object
spark = SparkSession.builder.getOrCreate()

  • Load csv file
spark.read.format("csv").option("header", "true").load("file:///xxx.csv")

Three. Case

1. Data exploration: statistics of overall data information of life expectancy data

from pyspark.sql import SparkSession
# create the spark session
spark = SparkSession.builder.getOrCreate()
from pyspark.sql.functions import col
from pyspark.sql.types import DoubleType
# load the data
df = spark.read.format("csv").option("delimiter", " ").load("file:///root/example/LifeExpentancy.txt") \
    .withColumn("Country", col("_c0")) \
    .withColumn("LifeExp", col("_c2").cast(DoubleType())) \
    .withColumn("Region", col("_c4")) \
    .select(col("Country"), col("LifeExp"), col("Region"))
df.describe("LifeExp").show()

2. Mixed use of Spark and Python third-party libraries

Use Spark to do big data ETL
processed data using Python third-party libraries to analyze or display

  • Pandas do data analysis
  • Pandas DataFrame 转 Spark DataFrame
spark.createDataFrame(pandas_df)     
  • Spark DataFrame转Pandas DataFrame
spark_df.toPandas() 
  • Matplotlib realizes data visualization
  • Scikit-learn completes machine learning
  • Conversion method between PandasDF and SparkDF
# Pandas DataFrame to Spark DataFrame
import numpy as np
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
pandas_df = pd.read_csv("./products.csv", header=None, usecols=[1, 3, 5])
print(pandas_df)
# convert to Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)     
spark_df.show()
df = spark_df.withColumnRenamed("1", "id").withColumnRenamed("3", "name").withColumnRenamed("5", "remark")
# convert back to Pandas DataFrame
df.toPandas() 

3. Use PySpark to explore data through graphs

  • Divide the data into multiple intervals, and count the number of data in the interval
# from previous LifeExpentancy example
rdd = df.select("LifeExp").rdd.map(lambda x: x[0])
#把数据划为10个区间,并获得每个区间中的数据个数
(countries, bins) = rdd.histogram(10)
print(countries)
print(bins)
import matplotlib.pyplot as plt 
import numpy as np 

plt.hist(rdd.collect(), 10)  # by default the # of bins is 10
plt.title("Life Expectancy Histogram") 
plt.xlabel("Life Expectancy") 
plt.ylabel("Countries") 

Insert picture description here

Guess you like

Origin blog.csdn.net/sun_0128/article/details/108310756