Just very excited to want to try to install the Hadoop and spark, we take this small application to test Wordcount
First experimental local version of pyspark
$ pyspark
shell started up
>>> sc.master
u'local[*]'
We can see the local master
>>> text = sc.textFile("shakespeare.txt")
>>> from operator import add
>>> def token(text):
... return text.split()
...
>>> words = text.flatMap(token)
>>> wc = words.map(lambda x:(x,1))
>>> counts = wc.reduceByKey(add)
>>> counts.saveAsTextFile('wc')
In the absence of such a configuration on it
if you want to run on Hadoop spark then you need to configure it:
copy it two documents
$ cd $SPARK_HOME/conf
$ cp spark-env.sh.template spark-env.sh
$ cp slave.template slave
$ vim spark-env.sh
Add the following:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/srv/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_LOCAL_IP=127.0.0.1
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
$ vim slave
Add to:
localhost 127.0.0.1
For the yarn-site.xml configuration look at my previous article Hadoop configuration, otherwise there may be reported memory errors.
This time you can start the Hadoop service
$ $HADOOP_HOME/sbin/start-dfs.sh
$ $HADOOP_HOME/sbin/start-yarn.sh
$ jps
5666 DataNode
6089 ResourceManager
5930 SecondaryNameNode
6250 NodeManager
5502 NameNode
6607 Jps
Have started up
in previous articles has been to put the files on Hadoop on top of you, do not turn the students about the previous article.
We start again pyspark, this time with yarn-client as a driver, the driver is running in a client process in this mode, for those who want to get timely results or interactive mode dependency can use this mode. There is a pattern yarn-cluster, for long-running jobs or no user intervention, the driver inside ApplicationMaster.
pyspark --master yarn --deploy-mode client
>>> sc.master
u'yarn'
Enter the code below is still the same
but this time we are going to read files on hdfs, so use hdfs agreement
>>> text = sc.textFile("hdfs://localhost:9000/user/hadoop/shakespeare.txt")
>>> from operator import add
>>> def token(text):
... return text.split()
...
>>> words = text.flatMap(token)
>>> wc = words.map(lambda x:(x,1))
>>> counts = wc.reduceByKey(add)
>>> counts.saveAsTextFile("wc")
Test what the results of it ~
$ hadoop fs -ls /user/hadoop/wc
Found 3 items
-rw-r--r-- 1 hadoop supergroup 0 2020-01-12 18:24 /user/hadoop/wc/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 3074551 2020-01-12 18:24 /user/hadoop/wc/part-00000
-rw-r--r-- 1 hadoop supergroup 3085307 2020-01-12 18:24 /user/hadoop/wc/part-00001
$ hadoop fs -tail wc/part-00000 | less
u'winterstale@145208', 1)
(u'muchadoaboutnothing@65485', 1)
(u'midsummersnightsdream@99', 1)
(u'hamlet@147754', 1)
(u'tamingoftheshrew@36231', 1)
(u"'ld", 1)
(u'roars', 3)
(u'2kinghenryvi@83454', 1)
(u'kinghenryv@109398', 1)
(u'juliuscaesar@101945', 1)
(u'twogentlemenofverona@2385', 1)
(u'hamlet@107567', 1)
(u'hamlet@7588', 1)
(u"unmaster'd", 1)
(u'kinghenryviii@23697', 1)
(u'lance', 4)
(u'coriolanus@72688', 1)
(u'cymbeline@145819', 1)
(u"serpent's", 8)
(u'asyoulikeit@43360', 1)
(u'antonyandcleopatra@16482', 1)
(u'prolixious', 1)
(u'cymbeline@137535', 1)
(u'loveslabourslost@37179', 1)
(u'antonyandcleopatra@153227', 1)
(u'muchadoaboutnothing@75680', 1)
(u'SCROOP]', 1)
(u'kinghenryviii@9720', 1)
(u'prenez', 1)
(u'garb', 2)
(u'roar!', 1)
(u'Craves', 1)
(u'twelfthnight@27306', 1)
(u'palace!', 1)
(u'tempest@83776', 1)
(u'SELD', 1)
(u"Howsoe'er", 1)
(u'belied,', 1)
(u'twogentlemenofverona@74060', 1)
(u'timonofathens@55279', 1)
(u"warrant's", 1)
(u'vane', 2)
(u'roar;', 1)
(u'MOTH]', 7)
(u'belied.', 1)
(u'roar?', 2)
On this page you can finally manage to look at the implementation of the application
http: // localhost: 8088 / cluster