Spark series-experiment 1-Spark Shell basics

Shell operation of Scala, Python and R in Spark
Experimental environment
Linux Ubuntu 16.04
Prerequisites:

  1. Java runtime environment deployment completed
  2. The R language runtime environment is deployed
  3. The deployment of Spark Local mode is complete
    . We have prepared the above prerequisites for you.

Experimental content
Under the above prerequisites, complete the Shell operations of Scala, Python and R in Spark

Experimental step
1. Click "Command Line Terminal" to open a new window

2. Start Scala Shell

Scala is the default language of Spark. Enter the following command in the command line terminal to start Scala Shell

spark-shell

After startup, the terminal displays as follows:

dolphin@tools:~$ spark-shell 
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/08/06 08:17:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://172.21.0.2:4040
Spark context available as 'sc' (master = local[*], app id = local-1533543446142).
Spark session available as 'spark'.
Welcome to
____              __
/ __/__  ___ _____/ /__
_\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 2.1.3
/_/
 
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_161)
Type in expressions to have them evaluated.
Type :help for more information.
 
scala>

Scala> appears as above, indicating that you have entered the Scala Shell

3. Use the Scala shell to complete the word statistics case

Enter the following Scala statement in the Scala shell

sc.textFile("file:///home/dolphin/words.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).saveAsTextFile("file:///home/dolphin/output")

After execution, the display is as follows:

scala>sc.textFile("file:///home/dolphin/words.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).saveAsTextFile("file:///home/dolphin/output")
 
scala>

At this time, the result file is generated in the /home/dolphin/output directory. We download the Scala command line and enter the following command to exit the Scala command line

:quit

At this time, I returned to the command line terminal.
Enter the following command to view the result

cat ~/output/part-*

The following is displayed after execution, indicating that the word statistics case is complete.

dolphin@tools:~$ cat ~/output/part-*
(are,2)
(am,1)
(how,1)
(dolphin,2)
(hello,2)
(what,1)
(now,1)
(world,1)
(you,2)
(i,1)
(words.txt,1)
(doing,1)

Description of the code:
sc is the SparkContext object, which is the entry
textFile ("file:///home/dolphin/words.txt") of the submitted Spark program is to read the data locally
flatMap( .split(" ")) First map is flattening
map((
,1)) to form a tuple of words and 1
reduceByKey( + ) to reduce according to key, and accumulate value to
saveAsTextFile("file:///home/dolphin/output") and write the result Into the local

4. Start Python Shell

Enter the following command in the command line terminal to start the Python Shell

pyspark

After startup, the display is as follows

dolphin@tools:~$ pyspark 
Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/08/07 02:40:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/08/07 02:40:41 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
____              __
/ __/__  ___ _____/ /__
_\ \/ _ \/ _ `/ __/  '_/
/__ / .__/\_,_/_/ /_/\_\   version 2.1.3
/_/
 
Using Python version 2.7.12 (default, Dec  4 2017 14:50:18)
SparkSession available as 'spark'.
>>>

As above, >>> appears to indicate that you have entered the Python Shell
5. A case of using Spark Python to implement word filtering

Execute the following code in the python command line

lines = sc.textFile("file:///apps/spark/README.md")
pythonLines = lines.filter(lambda line: "Python" in line)
pythonLines.count()
pythonLines.collect()

After execution, the display is as follows:

>>>lines = sc.textFile("file:///apps/spark/README.md")
>>>pythonLines = lines.filter(lambda line: "Python" in line)
>>>pythonLines.count()
3
>>>pythonLines.collect()
[u'high-level APIs in Scala, Java, Python, and R, and an optimized engine that', u'## Interactive Python Shell', u'Alternatively, if you prefer Python, you can use the Python shell:']

Description of the code:
sc is the SparkContext object, which is the entry for submitting the Spark program.
textFile("file:///apps/spark/README.md") is the data read in hdfs
filter(lambda line: "Python" in line) is to filter out all lines that do not contain "Python"
count() is the number of statistics
collect() is to list all results

Exit the Python shell and execute the following command

exit()

6. Shell using R language

Enter the following command in the command line terminal to start R Shell

sparkR

After execution, the display is as follows:

dolphin@tools:~$ sparkR
 
R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
 
R是自由软件,不带任何担保。
在某些条件下你可以将其自由散布。
用'license()''licence()'来看散布的详细条件。
 
R是个合作计划,有许多人为之做出了贡献.
用'contributors()'来看合作者的详细情况
用'citation()'会告诉你如何在出版物中正确地引用R或R程序包。
 
用'demo()'来看一些示范程序,用'help()'来阅读在线帮助文件,或
用'help.start()'通过HTML浏览器来看帮助文件。
用'q()'退出R.
 
Launching java with spark-submit command /apps/spark/bin/spark-submit   "sparkr-shell" /tmp/Rtmpp6WiOW/backend_port31e446162cd0
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/08/07 03:52:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/08/07 03:52:18 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
 
Welcome to
____              __
/ __/__  ___ _____/ /__
_\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version  2.1.3
/_/
 
 
SparkSession available as 'spark'.
在启动过程中 - Warning messages:
1: In Filter(nzchar, unlist(strsplit(input, ",|\\s"))) :
位元組碼版本不配; 用eval
2: 程辑包‘SparkR’是用R版本3.4.4 来建造的
>

As shown above> indicates that you have entered the Shell of the R language

7. Use R to manipulate SparkDataFrame

Create a SparkDataFrame

people <- read.df("/apps/spark/examples/src/main/resources/people.json", "json")

Show the data in df

head(people)

After execution, the display is as follows

>head(people)
age    name
1  NA Michael
2  30    Andy
3  19  Justin

Show only one column of data

head(select(people, "name"))

After execution, the display is as follows

>head(select(people, "name"))
name
1 Michael
2    Andy
3  Justin

Filter data

head(filter(people, people$age >20))

After execution, the display is as follows

>head(filter(people, people$age >20))
age name
1  30 Andy

*至此,本次实验结束啦*

Guess you like

Origin blog.csdn.net/qq_46009608/article/details/110179004