Big Data （二）：Spark入门教程

一、准备工作

详见《 Spark处理框架搭建【VM15 + CentOS7 + Hadoop2.7.2 + Scala2.12.7 + Spark2.3.1】》

二、示例运行

在 ./examples/src/main 目录下执行 Spark示例程序，主要包含 Scala、Java、Python、R 等不同的语言版本。
在这里插入图片描述

2.1 Scala示例

鉴于Spark本身是基于Scala开发，使用也较方便。
在这里插入图片描述

[root@master hadoop-2.7.2]# cd /opt/spark/spark-2.3.1-bin-hadoop2.7
[root@master spark-2.3.1-bin-hadoop2.7]# ./bin/run-example SparkKMeans

2.2 Python示例

Python作为当前大热的编程语言，当然也必须提及。备注:R文件提交方式与其一致，早期的SparkR已不再适用于新版本。
在这里插入图片描述

[root@master hadoop-2.7.2]# cd /opt/spark/spark-2.3.1-bin-hadoop2.7
[root@master spark-2.3.1-bin-hadoop2.7]# ./bin/spark-submit examples/src/main/python/sort.py

三、Spark-shell交互

3.1 Pyspark

由于Centos7自带python版本为2.7，且未预装pip，但诸如mlib等执行需加载第三方库。

# 首先安装epel扩展源
[root@master ~]# yum -y install epel-release
# 安装pip
[root@master ~]# yum -y install python-pip
# 清除cache
[root@master ~]# yum clean all
# 安装第三方库numpy
[root@master ~]# pip install numpy
Collecting numpy
  Downloading https://files.pythonhosted.org/packages/9e/eb/c9eda9f4865d669e0bb37ce5c780e86c63daa54ca827b95a171429012d08/numpy-1.15.3-cp27-cp27mu-manylinux1_x86_64.whl (13.8MB)
    100% |████████████████████████████████| 13.8MB 17kB/s 
Installing collected packages: numpy
Successfully installed numpy-1.15.3
You are using pip version 8.1.2, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
# 更新pip版本
[root@master ~]# pip install --upgrade pip
Collecting pip
  Downloading https://files.pythonhosted.org/packages/c2/d7/90f34cb0d83a6c5631cf71dfe64cc1054598c843a92b400e55675cc2ac37/pip-18.1-py2.py3-none-any.whl (1.3MB)
    100% |████████████████████████████████| 1.3MB 45kB/s 
Installing collected packages: pip
  Found existing installation: pip 8.1.2
    Uninstalling pip-8.1.2:
      Successfully uninstalled pip-8.1.2
Successfully installed pip-18.1

Bug ：在安装***matplotlib***模块时候，遭遇Cannot uninstall 'pyparsing'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall，以致安装失败。
解决方法：忽略错误，强制安装sudo pip install --ignore-installed matplotlib,然后pip list就能看到matplotlib成功安装。

借助SparkpythonAPI，创建一个RDD，并进行actions和transformations两种不同类型操作。

cd /opt/spark/spark-2.3.1-bin-hadoop2.7
./bin/pyspark      #启动API

>>> lines=sc.textFile("README.md")
>>> lines.count()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/spark/spark-2.3.1-bin-hadoop2.7/python/pyspark/rdd.py", line 1073, in count
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "/opt/spark/spark-2.3.1-bin-hadoop2.7/python/pyspark/rdd.py", line 1064, in sum
    return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
  File "/opt/spark/spark-2.3.1-bin-hadoop2.7/python/pyspark/rdd.py", line 935, in fold
    vals = self.mapPartitions(func).collect()
  File "/opt/spark/spark-2.3.1-bin-hadoop2.7/python/pyspark/rdd.py", line 834, in collect
    sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File "/opt/spark/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/opt/spark/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/opt/spark/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://master:9000/user/root/README.md
	at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:54)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:162)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

由于之前是从/opt/spark/spark-2.3.1-bin-hadoop2.7启动pyspark，然后读取README.md文件的，执行count语句，会出现以下错误：

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://master:9000/user/root/README.md

这是因为在使用相对路径时，系统默认是从hdfs://localhost:9000/目录下读取README.md文件的，但是README.md文件并不在这一目录下，所以sc.textFile()必须使用绝对路径，此时代码修改为：

>>> lines = sc.textFile("file:///opt/spark/spark-2.3.1-bin-hadoop2.7/README.md")
>>> lines.count()            #数量统计
103                                                                             
>>> lines.first()            #第一个字符
u'# Apache Spark'
>>> lines.map(lambda line: len(line.split())).reduce(lambda a, b: a if (a>b) else b)
22                           #匿名函数
>>> wordCounts = lines.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)                 
                             #MapReduce实现
>>> wordCounts.collect()
[(u'project.', 1), (u'help', 1), (u'when', 1), (u'Hadoop', 3), (u'MLlib', 1), (u'"local"', 1), (u'./dev/run-tests', 1), (u'including', 4), (u'graph', 1), (u'computation', 1), (u'file', 1), (u'high-level', 1), (u'find', 1), (u'web', 1), (u'Shell', 2), (u'cluster', 2), (u'also', 4), (u'using:', 1), (u'Big', 1), (u'guidance', 2), (u'run:', 1), (u'Scala,', 1), (u'Running', 1), (u'should', 2), (u'environment', 1), (u'to', 17), (u'only', 1), (u'module,', 1), (u'given.', 1), (u'rich', 1), (u'directory.', 1), (u'Apache', 1), (u'Interactive', 2), (u'sc.parallelize(range(1000)).count()', 1), (u'Building', 1), (u'do', 2), (u'guide,', 1), (u'return', 2), (u'which', 2), (u'than', 1), (u'Programs', 1), (u'Many', 1), (u'Try', 1), (u'built,', 1), (u'YARN,', 1), (u'R,', 1), (u'using', 5), (u'Example', 1), (u'scala>', 1), (u'Once', 1), (u'-DskipTests', 1), (u'Spark"](http://spark.apache.org/docs/latest/building-spark.html).', 1), (u'and', 9), (u'Because', 1), (u'cluster.', 1), (u'name', 1), (u'-T', 1), (u'Testing', 1), (u'optimized', 1), (u'Streaming', 1), (u'./bin/pyspark', 1), (u'SQL', 2), (u'through', 1), (u'GraphX', 1), (u'them,', 1), (u'guide](http://spark.apache.org/contributing.html)', 1), (u'[run', 1), (u'analysis.', 1), (u'development', 1), (u'abbreviated', 1), (u'set', 2), (u'For', 3), (u'Scala', 2), (u'##', 9), (u'the', 24), (u'thread,', 1), (u'library', 1), (u'see', 3), (u'individual', 1), (u'examples', 2), (u'MASTER', 1), (u'runs.', 1), (u'[Apache', 1), (u'Pi', 1), (u'instructions.', 1), (u'More', 1), (u'Python,', 2), (u'#', 1), (u'processing,', 1), (u'for', 12), (u'several', 1), (u'review', 1), (u'its', 1), (u'contributing', 1), (u'This', 2), (u'Developer', 1), (u'version', 1), (u'provides', 1), (u'print', 1), (u'get', 1), (u'Configuration', 1), (u'supports', 2), (u'command,', 2), (u'[params]`.', 1), (u'refer', 2), (u'available', 1), (u'be', 2), (u'Guide](http://spark.apache.org/docs/latest/configuration.html)', 1), (u'run', 7), (u'./bin/run-example', 2), (u'Versions', 1), (u'["Parallel', 1), (u'Hadoop,', 2), (u'Documentation', 1), (u'use', 3), (u'downloaded', 1), (u'distributions.', 1), (u'Spark.', 1), (u'example:', 1), (u'by', 1), (u'package.', 1), (u'Maven](http://maven.apache.org/).', 1), (u'["Building', 1), (u'thread', 1), (u'package', 1), (u'of', 5), (u'changed', 1), (u'programming', 1), (u'Spark', 16), (u'against', 1), (u'site,', 1), (u'Maven,', 1), (u'3"](https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3).', 1), (u'or', 3), (u'comes', 1), (u'first', 1), (u'info', 1), (u'contains', 1), (u'can', 7), (u'overview', 1), (u'package.)', 1), (u'Please', 4), (u'one', 3), (u'Contributing', 1), (u'(You', 1), (u'Online', 1), (u'tools', 1), (u'your', 1), (u'page](http://spark.apache.org/documentation.html).', 1), (u'threads.', 1), (u'Tests', 1), (u'fast', 1), (u'from', 1), (u'[project', 1), (u'APIs', 1), (u'>>>', 1), (u'SparkPi', 2), (u'locally', 2), (u'system', 1), (u'submit', 1), (u'`examples`', 2), (u'systems.', 1), (u'start', 1), (u'IDE,', 1), (u'params', 1), (u'build/mvn', 1), (u'way', 1), (u'basic', 1), (u'README', 1), (u'<http://spark.apache.org/>', 1), (u'It', 2), (u'graphs', 1), (u'more', 1), (u'engine', 1), (u'project', 1), (u'option', 1), (u'on', 7), (u'started', 1), (u'Note', 1), (u'N', 1), (u'usage', 1), (u'versions', 1), (u'DataFrames,', 1), (u'particular', 2), (u'instance:', 1), (u'./bin/spark-shell', 1), (u'general', 3), (u'with', 4), (u'easiest', 1), (u'protocols', 1), (u'must', 1), (u'And', 1), (u'builds', 1), (u'developing', 1), (u'this', 1), (u'setup', 1), (u'shell:', 2), (u'will', 1), (u'`./bin/run-example', 1), (u'following', 2), (u'Hadoop-supported', 1), (u'distribution', 1), (u'Maven', 1), (u'example', 3), (u'are', 1), (u'detailed', 2), (u'Data.', 1), (u'mesos://', 1), (u'stream', 1), (u'computing', 1), (u'URL,', 1), (u'is', 6), (u'in', 6), (u'higher-level', 1), (u'tests', 2), (u'1000:', 2), (u'an', 4), (u'sample', 1), (u'To', 2), (u'tests](http://spark.apache.org/developer-tools.html#individual-tests).', 1), (u'tips,', 1), (u'at', 2), (u'have', 1), (u'1000).count()', 1), (u'["Specifying', 1), (u'[building', 1), (u'You', 4), (u'configure', 1), (u'information', 1), (u'different', 1), (u'Tools"](http://spark.apache.org/developer-tools.html).', 1), (u'MASTER=spark://host:7077', 1), (u'no', 1), (u'not', 1), (u'Java,', 1), (u'that', 2), (u'storage', 1), (u'documentation,', 1), (u'same', 1), (u'machine', 1), (u'how', 3), (u'need', 1), (u'other', 1), (u'build', 4), (u'prefer', 1), (u'online', 1), (u'you', 4), (u'if', 4), (u'[Contribution', 1), (u'A', 1), (u'About', 1), (u'HDFS', 1), (u'[Configuration', 1), (u'sc.parallelize(1', 1), (u'locally.', 1), (u'Hive', 2), (u'["Useful', 1), (u'running', 1), (u'uses', 1), (u'a', 8), (u'Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version)', 1), (u'variable', 1), (u'The', 1), (u'data', 1), (u'class', 2), (u'built', 1), (u'building', 2), (u'"yarn"', 1), (u'Python', 2), (u'Thriftserver', 1), (u'processing.', 1), (u'programs', 2), (u'requires', 1), (u'documentation', 3), (u'pre-built', 1), (u'Alternatively,', 1), (u'programs,', 1), (u'"local[N]"', 1), (u'Spark](#building-spark).', 1), (u'clean', 1), (u'<class>', 1), (u'spark://', 1), (u'learning,', 1), (u'core', 1), (u'talk', 1), (u'latest', 1)]
                             #计算结果收集

四、独立程序运行

Java 程序使用 Maven 编译打包，Scala 程序使用 sbt 进行编译打包，而 Python 程序通过 spark-submit 直接提交。

流浪中的UncleLivin

发布了15 篇原创文章 · 获赞 5 · 访问量 6170

私信关注