Spark介绍与安装详解(Centos7)

版权声明:本文为博主原创学习笔记,如需转载请注明来源: https://blog.csdn.net/wugenqiang/article/details/81232320

Spark:

(1)是一个高速的可扩充的处理海量数据的引擎

(2)使用scala开发

(3)提供spark shell 供开发人员学习或者处理数据

(4)可以使用python,java,R,scala语言开发spark应用程序,用于海量数据处理

(5)Spark是一个针对超大数据集合的低延迟的集群分布式计算系统,比MapReducer快40倍左右。

(6)Spark是hadoop的升级版本,Hadoop作为第一代产品使用HDFS,第二代加入了Cache来保存中间计算结果,并能适时主动推Map/Reduce任务,第三代就是Spark倡导的流Streaming。

(7)Spark兼容Hadoop的APi,能够读写Hadoop的HDFS HBASE 顺序文件等。

注意:在linux上安装spark ,前提要部署了hadoop,并且安装了scala.

本人安装的对应版本:

名称 版本
JDK 1.8.0.151
hadoop 2.6.3.0-235
scala 2.11.0
spark 2.3.1

一、下载

官网:http://spark.apache.org/downloads.html

或者网址:https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.3.1/

下载最新版本: spark-2.3.1-bin-hadoop2.6.tgz

二、解压

命令:tar -zxvf spark-2.3.1-bin-hadoop2.6.tgz
[root@wugenqiang ~]# ls
anaconda-ks.cfg  metastore_db     spark-2.3.1-bin-hadoop2.6.tgz
derby.log        original-ks.cfg  wugenqiang.hello
[root@wugenqiang ~]# tar xvfz spark-2.3.1-bin-hadoop2.6.tgz 

三、移动路径

[root@wugenqiang ~]# cd /usr/local
[root@wugenqiang local]# ls
bin  etc  games  include  lib  lib64  libexec  sbin  share  src
[root@wugenqiang local]# mv /root/spark-2.3.1-bin-hadoop2.6 /usr/local
[root@wugenqiang local]# ls
bin  games    lib    libexec  share                      src
etc  include  lib64  sbin     spark-2.3.1-bin-hadoop2.6

四、配置环境变量

1.命令:vim /etc/profile

[root@wugenqiang ~]# vim /etc/profile

添加:

export SPARK_HOME=/usr/local/spark-2.3.1-bin-hadoop2.6
export PATH=${PATH}:$SPARK_HOME/bin
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH

2.使生效

[root@wugenqiang ~]# source /etc/profile

五、使用rpm包

1.安装软件:

[root@wugenqiang ~]# yum install -y spark2 spark2-python
已加载插件:fastestmirror
Loading mirror speeds from cached hostfile
软件包 spark2-2.2.0.2.6.3.0-235.noarch 已安装并且是最新版本
软件包 spark2-python-2.2.0.2.6.3.0-235.noarch 已安装并且是最新版本
无须任何处理

六、进入spark安装位置, 然后进入spark中的 bin 文件夹

[root@wugenqiang ~]# cd /usr/local/spark-2.3.1-bin-hadoop2.6/
[root@wugenqiang spark-2.3.1-bin-hadoop2.6]# cd bin
[root@wugenqiang bin]# ls
beeline               pyspark.cmd       spark-shell
beeline.cmd           run-example       spark-shell2.cmd
docker-image-tool.sh  run-example.cmd   spark-shell.cmd
find-spark-home       spark-class       spark-sql
find-spark-home.cmd   spark-class2.cmd  spark-sql2.cmd
load-spark-env.cmd    spark-class.cmd   spark-sql.cmd
load-spark-env.sh     sparkR            spark-submit
pyspark               sparkR2.cmd       spark-submit2.cmd
pyspark2.cmd          sparkR.cmd        spark-submit.cmd

1.运行: spark-shell使得运行scala

[root@wugenqiang bin]# spark-shell

results:

[root@wugenqiang bin]# spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/07/27 11:11:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://192.168.75.213:4040
Spark context available as 'sc' (master = local[*], app id = local-1532661080069).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0.2.6.3.0-235
      /_/
         
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

2.运行pyspark    (python)

[root@wugenqiang ~]# source /etc/profile
[root@wugenqiang ~]# pyspark
Python 2.7.5 (default, Aug  4 2017, 00:39:18) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
18/07/27 11:47:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.1
      /_/

Using Python version 2.7.5 (default, Aug  4 2017 00:39:18)
SparkSession available as 'spark'.
>>> 

七、调整日志级别控制输出的信息量:

在conf目录下将log4j.properties.template 复制为  log4j.properties, 然后找到 log4j.rootCategory = INFO, console

[root@wugenqiang spark-2.3.1-bin-hadoop2.6]# ls
bin   data      jars        LICENSE   NOTICE  R          RELEASE  yarn
conf  examples  kubernetes  licenses  python  README.md  sbin
[root@wugenqiang spark-2.3.1-bin-hadoop2.6]# cd conf
[root@wugenqiang conf]# ls
docker.properties.template   slaves.template
fairscheduler.xml.template   spark-defaults.conf.template
log4j.properties.template    spark-env.sh.template
metrics.properties.template
[root@wugenqiang conf]# cp log4j.properties.template log4j.properties
[root@wugenqiang conf]# ls
docker.properties.template  metrics.properties.template
fairscheduler.xml.template  slaves.template
log4j.properties            spark-defaults.conf.template
log4j.properties.template   spark-env.sh.template

将INFO改为WARN (也可以设置为其他级别)

[root@wugenqiang conf]# vim log4j.properties
log4j.rootCategory=INFO, console

改成:

log4j.rootCategory=WARN, console

那么,之后再打开shell输入信息量会减少.

八、打开网址

http://192.168.75.213:4040

猜你喜欢

转载自blog.csdn.net/wugenqiang/article/details/81232320