Spark stand-alone environment configuration

Overview

Big data and artificial intelligence have been publicized for many years, Hadoop and Spark has been released for a long time, always wanted to try, but never quite work the scene to use, it has been dragged. In this election on time Geeks lessons Caiyuan Nan teacher's "large-scale data processing combat" in which a lot of content introduced Spark, and in this opportunity, but also the virtual machine is configured with Spark stand-alone environment.

On the one hand, getting to know the usage of Spark; on the other hand, although there is no access to the scene of big data analysis, but even reaching out for Spark in mechanisms to deal with big data, API design, you can also explore the idea of ​​normal programming.

Spark stand-alone environment configuration

I configured on Debian10.

JDK environment configuration

JDK using Oracle's standard JDK1.8 version, Oracle home from the official website to download JDK very slow, it is recommended to use Huawei's Mirror: https://mirrors.huaweicloud.com/java/jdk/8u202-b08/jdk-8u202- linux-x64.tar.gz

After downloading, I unzip it to / usr / local folder

$ wget https://mirrors.huaweicloud.com/java/jdk/8u202-b08/jdk-8u202-linux-x64.tar.gz 
$ sudo tar zxvf jdk-8u202-linux-x64.tar.gz -C /usr/local 

Then configure the environment variables, if a bash, configure ~ / .bashrc; if zsh, configure ~ / .zshenv

# java
export JAVA_HOME=/usr/local/jdk1.8
export PATH=$PATH:$JAVA_HOME/bin

Once configured, whether successful installation configuration command to check the following:

$ java -version
java version "1.8.0_202"
Java(TM) SE Runtime Environment (build 1.8.0_202-b08)
Java HotSpot(TM) 64-Bit Server VM (build 25.202-b08, mixed mode)

Spark environment configuration

Spark is also very simple to install, download the latest packagea from the official website, I downloaded the latest version is as follows:

$ wget http://mirror.bit.edu.cn/apache/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop2.7.tgz
$ sudo tar zxvf spark-3.0.0-preview2-bin-hadoop2.7.tgz -C /usr/local

After downloading Similarly, extract it to / usr / local folder

Spark also need to configure the environment variables: (JDK with the same configuration, according to you are using bash or zsh, configure the environment variables to different files)

# spark
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin

After configuration, enter the following command at the command line to see if it ran successfully:

$ pyspark
Python 2.7.16 (default, Oct 10 2019, 22:02:15)
[GCC 8.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
20/03/02 15:21:23 WARN Utils: Your hostname, debian-wyb resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
20/03/02 15:21:23 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/03/02 15:21:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
/usr/local/spark/python/pyspark/context.py:219: DeprecationWarning: Support for Python 2 and Python 3 prior to version 3.6 is deprecated as of Spark 3.0. See also the plan for dropping Python 2 support at https://spark.apache.org/news/plan-for-dropping-python-2-support.html.
  DeprecationWarning)
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.0.0-preview2
      /_/

Using Python version 2.7.16 (default, Oct 10 2019 22:02:15)
SparkSession available as 'spark'.

Note After 2.x version of pyspark here used python, python follow-up we configured the environment, will be developed under python3

python environment configuration

Debian10 system comes with python2 and python3 environment, in order not to affect the default setting of the existing system, we installed virtualenv use spark

First, install the virtualenv, and generates a separate environment python3

$ pip3 install virtualenv
$ virtualenv py3-vm

py3-vm started, and wherein the mounting pyspark, examples of the development of the spark

$ source ./py3-vm/bin/activate
$ pip install pyspark
$ pip install findspark

Exit above py3-vm, use the following command:

$ deactive

Example of use Spark

After the above configuration environment, following a simple example to try to spark the power of API at the example we construct a statistical order:

  1. Data source: csv file format orders per line 3 information, order number (not repeat), shop name, order amount
  2. Orders Statistics: according to statistics shop orders
  3. Order Amount Statistics: according to statistics shop Order Amount

Sample Code (order_stat.py)

 1  import findspark
 2  
 3  findspark.init()
 4  
 5  if __name__ == "__main__":
 6      from pyspark.sql import SparkSession
 7      from pyspark.sql.functions import *
 8  
 9      spark = SparkSession\
10          .builder\
11          .appName('order stat')\
12          .getOrCreate()
13  
14      lines = spark.read.csv("./orders.csv",
15                             sep=",",
16                             schema="order INT, shop STRING, price DOUBLE")
17  
18      # 统计各个店铺的订单数
19      orderCounts = lines.groupBy('shop').count()
20      orderCounts.show()
21  
22      # 统计各个店铺的订单金额
23      shopPrices = lines.groupBy('shop').sum('price')
24      shopPrices.show()
25  
26      spark.stop()

Csv file with the content of the test (orders.csv)

1,京东,10.0
2,京东,20.0
3,天猫,21.0
4,京东,22.0
5,天猫,11.0
6,京东,22.0
7,天猫,23.0
8,天猫,24.0
9,天猫,40.0
10,天猫,70.0
11,天猫,10.0
12,天猫,20.0

operation result

$ python order_stat.py
20/03/02 17:40:50 WARN Utils: Your hostname, debian-wyb resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
20/03/02 17:40:50 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/03/02 17:40:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
+----|-----+
|shop|count|
+----|-----+
|京东|    4|
|天猫|    8|
+----|-----+

+----|----------+
|shop|sum(price)|
+----|----------+
|京东|      74.0|
|天猫|     219.0|
+----|----------+

Guess you like

Origin www.cnblogs.com/wang_yb/p/12396966.html