Spark系列——实验1-Spark Shell基础

Spark中Scala、Python和R的Shell操作
实验环境
Linux Ubuntu 16.04
前提条件：

Java 运行环境部署完成
R语言运行环境部署完成
Spark Local模式部署完成
上述前提条件，我们已经为你准备就绪了。

实验内容
在上述前提条件下，完成Spark中Scala、Python和R的Shell操作

实验步骤
1.点击"命令行终端"，打开新窗口

2.启动Scala的Shell

Scala是Spark默认的语言，在命令行终端中输入下面的命令即可启动Scala Shell

spark-shell

启动后终端显示如下：

dolphin@tools:~$ spark-shell 
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/08/06 08:17:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://172.21.0.2:4040
Spark context available as 'sc' (master = local[*], app id = local-1533543446142).
Spark session available as 'spark'.
Welcome to
____              __
/ __/__  ___ _____/ /__
_\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 2.1.3
/_/
 
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_161)
Type in expressions to have them evaluated.
Type :help for more information.
 
scala>

如上出现了 Scala>表明进入了Scala的Shell

3.使用Scala shell完成单词统计案例

在Scala shell中输入下面的Scala语句

sc.textFile("file:///home/dolphin/words.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).saveAsTextFile("file:///home/dolphin/output")

执行后显示如下：

scala>sc.textFile("file:///home/dolphin/words.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).saveAsTextFile("file:///home/dolphin/output")
 
scala>

此时，在/home/dolphin/output目录下产生了结果文件。我们下载Scala命令行下输入下面命令来退出Scala命令行

扫描二维码关注公众号，回复： 13008270 查看本文章

:quit

此时又返回了命令行终端
输入下面的命令，来查看结果

cat ~/output/part-*

执行后显示如下，表明单词统计案例完成。

dolphin@tools:~$ cat ~/output/part-*
(are,2)
(am,1)
(how,1)
(dolphin,2)
(hello,2)
(what,1)
(now,1)
(world,1)
(you,2)
(i,1)
(words.txt,1)
(doing,1)

关于代码的说明：
sc是SparkContext对象，该对象是提交Spark程序的入口
textFile(“file:///home/dolphin/words.txt”)是在本地读取数据
flatMap(.split(" "))先map在压平
map((,1))将单词和1构成元组
reduceByKey(+)按照key进行reduce，并将value累加
saveAsTextFile(“file:///home/dolphin/output”)将结果写入到本地中

4.启动Python的Shell

在命令行终端中输入下面的命令即可启动Python Shell

pyspark

启动后显示如下

dolphin@tools:~$ pyspark 
Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/08/07 02:40:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/08/07 02:40:41 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
____              __
/ __/__  ___ _____/ /__
_\ \/ _ \/ _ `/ __/  '_/
/__ / .__/\_,_/_/ /_/\_\   version 2.1.3
/_/
 
Using Python version 2.7.12 (default, Dec  4 2017 14:50:18)
SparkSession available as 'spark'.
>>>

如上出现了 >>>表明进入了Python的Shell
5.使用Spark Python实现单词筛选的案例

在python的命令行中执行下面的代码

lines = sc.textFile("file:///apps/spark/README.md")
pythonLines = lines.filter(lambda line: "Python" in line)
pythonLines.count()
pythonLines.collect()

执行后显示如下：

>>>lines = sc.textFile("file:///apps/spark/README.md")
>>>pythonLines = lines.filter(lambda line: "Python" in line)
>>>pythonLines.count()
3
>>>pythonLines.collect()
[u'high-level APIs in Scala, Java, Python, and R, and an optimized engine that', u'## Interactive Python Shell', u'Alternatively, if you prefer Python, you can use the Python shell:']

关于代码的说明：
sc是SparkContext对象，该对象是提交Spark程序的入口
textFile(“file:///apps/spark/README.md”)是hdfs中读取数据
filter(lambda line: “Python” in line)是过滤掉不含"Python"的所有行
count()是统计数量
collect()是列出所有结果

退出Python shell，执行下面的命令

exit()

6.使用R语言的Shell

在命令行终端中输入下面的命令即可启动R Shell

sparkR

执行后显示如下：

dolphin@tools:~$ sparkR
 
R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
 
R是自由软件，不带任何担保。
在某些条件下你可以将其自由散布。
用'license()'或'licence()'来看散布的详细条件。
 
R是个合作计划，有许多人为之做出了贡献.
用'contributors()'来看合作者的详细情况
用'citation()'会告诉你如何在出版物中正确地引用R或R程序包。
 
用'demo()'来看一些示范程序，用'help()'来阅读在线帮助文件，或
用'help.start()'通过HTML浏览器来看帮助文件。
用'q()'退出R.
 
Launching java with spark-submit command /apps/spark/bin/spark-submit   "sparkr-shell" /tmp/Rtmpp6WiOW/backend_port31e446162cd0
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/08/07 03:52:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/08/07 03:52:18 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
 
Welcome to
____              __
/ __/__  ___ _____/ /__
_\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version  2.1.3
/_/
 
 
SparkSession available as 'spark'.
在启动过程中 - Warning messages:
1: In Filter(nzchar, unlist(strsplit(input, ",|\\s"))) :
位元組碼版本不配; 用eval
2: 程辑包‘SparkR’是用R版本3.4.4 来建造的
>

如上显示 >表明进入了R语言的Shell

7.使用R操作SparkDataFrame

创建一个SparkDataFrame

people <- read.df("/apps/spark/examples/src/main/resources/people.json", "json")

显示df中的数据

head(people)

执行后显示如下

>head(people)
age    name
1  NA Michael
2  30    Andy
3  19  Justin

只显示某一列数据

head(select(people, "name"))

执行后显示如下

>head(select(people, "name"))
name
1 Michael
2    Andy
3  Justin

过滤数据

head(filter(people, people$age >20))

执行后显示如下

>head(filter(people, people$age >20))
age name
1  30 Andy

*至此，本次实验结束啦*

Spark系列——实验1-Spark Shell基础

猜你喜欢