Spark running on the local and Hadoop way

Just very excited to want to try to install the Hadoop and spark, we take this small application to test Wordcount

First experimental local version of pyspark

$ pyspark

shell started up

>>> sc.master
u'local[*]'

We can see the local master

>>> text = sc.textFile("shakespeare.txt")
>>> from operator import add
>>> def token(text):
...     return text.split()
... 
>>> words = text.flatMap(token)
>>> wc = words.map(lambda x:(x,1))
>>> counts = wc.reduceByKey(add)
>>> counts.saveAsTextFile('wc')

In the absence of such a configuration on it
if you want to run on Hadoop spark then you need to configure it:
copy it two documents

$ cd $SPARK_HOME/conf
$ cp spark-env.sh.template spark-env.sh
$ cp slave.template slave
$ vim spark-env.sh

Add the following:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/srv/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_LOCAL_IP=127.0.0.1
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
$ vim slave

Add to:

localhost 127.0.0.1

For the yarn-site.xml configuration look at my previous article Hadoop configuration, otherwise there may be reported memory errors.
This time you can start the Hadoop service

$ $HADOOP_HOME/sbin/start-dfs.sh
$ $HADOOP_HOME/sbin/start-yarn.sh
$ jps
5666 DataNode
6089 ResourceManager
5930 SecondaryNameNode
6250 NodeManager
5502 NameNode
6607 Jps

Have started up
in previous articles has been to put the files on Hadoop on top of you, do not turn the students about the previous article.
We start again pyspark, this time with yarn-client as a driver, the driver is running in a client process in this mode, for those who want to get timely results or interactive mode dependency can use this mode. There is a pattern yarn-cluster, for long-running jobs or no user intervention, the driver inside ApplicationMaster.

pyspark --master yarn --deploy-mode client

>>> sc.master
u'yarn'

Enter the code below is still the same
but this time we are going to read files on hdfs, so use hdfs agreement

>>> text = sc.textFile("hdfs://localhost:9000/user/hadoop/shakespeare.txt")
>>> from operator import add
>>> def token(text):
...     return text.split()
... 
>>> words = text.flatMap(token)
>>> wc = words.map(lambda x:(x,1))
>>> counts = wc.reduceByKey(add)
>>> counts.saveAsTextFile("wc")

Test what the results of it ~

$ hadoop fs -ls /user/hadoop/wc
Found 3 items
-rw-r--r--   1 hadoop supergroup          0 2020-01-12 18:24 /user/hadoop/wc/_SUCCESS
-rw-r--r--   1 hadoop supergroup    3074551 2020-01-12 18:24 /user/hadoop/wc/part-00000
-rw-r--r--   1 hadoop supergroup    3085307 2020-01-12 18:24 /user/hadoop/wc/part-00001

$ hadoop fs -tail wc/part-00000 | less
u'winterstale@145208', 1)
(u'muchadoaboutnothing@65485', 1)
(u'midsummersnightsdream@99', 1)
(u'hamlet@147754', 1)
(u'tamingoftheshrew@36231', 1)
(u"'ld", 1)
(u'roars', 3)
(u'2kinghenryvi@83454', 1)
(u'kinghenryv@109398', 1)
(u'juliuscaesar@101945', 1)
(u'twogentlemenofverona@2385', 1)
(u'hamlet@107567', 1)
(u'hamlet@7588', 1)
(u"unmaster'd", 1)
(u'kinghenryviii@23697', 1)
(u'lance', 4)
(u'coriolanus@72688', 1)
(u'cymbeline@145819', 1)
(u"serpent's", 8)
(u'asyoulikeit@43360', 1)
(u'antonyandcleopatra@16482', 1)
(u'prolixious', 1)
(u'cymbeline@137535', 1)
(u'loveslabourslost@37179', 1)
(u'antonyandcleopatra@153227', 1)
(u'muchadoaboutnothing@75680', 1)
(u'SCROOP]', 1)
(u'kinghenryviii@9720', 1)
(u'prenez', 1)
(u'garb', 2)
(u'roar!', 1)
(u'Craves', 1)
(u'twelfthnight@27306', 1)
(u'palace!', 1)
(u'tempest@83776', 1)
(u'SELD', 1)
(u"Howsoe'er", 1)
(u'belied,', 1)
(u'twogentlemenofverona@74060', 1)
(u'timonofathens@55279', 1)
(u"warrant's", 1)
(u'vane', 2)
(u'roar;', 1)
(u'MOTH]', 7)
(u'belied.', 1)
(u'roar?', 2)

On this page you can finally manage to look at the implementation of the application
http: // localhost: 8088 / cluster
Here Insert Picture Description

Published 78 original articles · won praise 7 · views 10000 +

Guess you like

Origin blog.csdn.net/yao09605/article/details/103949194