This article introduces the source code compilation of spark2.1.0
1. Compilation environment:
Jdk1.8 or above
Hadoop2.7.3
Scala2.10.4
Requirements:
Maven 3.3.9 or above (important)
Click here to download
http://mirror.bit.edu.cn/apache/maven/maven-3/3.5.2/binaries/apache-maven-3.5.2-bin.tar.gz
Modify /conf/setting.xml
<mirror>
<id>alimaven</id>
<name>aliyun maven</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<mirrorOf>central</mirrorOf>
</mirror>
2. Download http://spark.apache.org
2.1Download
2.2. Unzip
tar -zxvf spark-2.1.0.tgz
3. Enter the main directory, modify the compiled file, and compile
Modify make-distribution.sh in the spark-2.1.0/dev directory and comment out the original specified version, which can save time
vi make-distribution.sh
Tips:
As shown in the figure in this file, there is less " - " before czf , you need to modify it yourself
note:
If the hadoop version you use is cdh, you need to modify the pom.xml file in the spark root directory and add the dependency of cdh
<repository>
<id>cloudera</id>
<name>cloudera Repository</name>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
添加在<repositorys></repositorys>里
3.1 Set up memory
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
3.2 Compile
./dev/make-distribution.sh \
--name 2.7.3 \
--tgz \
-Pyarn \
-Phadoop-2.7 \ -Dhadoop.version=2.7.3 \
-Phive -Phive-thriftserver \
-DskipTests clean package
Then just wait quietly, the first compilation time may be very long, several hours or ten hours, depending on the network speed, because there are many packages to download
Command explanation:
--name 2.7.3 *** Specify the compiled spark name, name=
--tgz *** compress to tgz format
-Pyarn \ *** support yarn platform
-Phadoop-2.7 \ -Dhadoop.version=2.7.3 \ *** Specify hadoop version 2.7.3
-Phive -Phive-thriftserver \ ***支持hive
-DskipTests clean package *** Skip test package
Well, the compilation of spark is over here
Let me share some of the problems encountered in compilation
Error 1 :
Failed to execute goal on project spark-launcher_2.11:
Could not resolve dependencies for project org.apache.spark:spark-launcher_2.11:jar:2.1.0:
Failure to find org.apache.hadoop:hadoop-client:jar:hadoop2.7.3 in https://repo1.maven.org/maven2 was cached in the local repository,
resolution will not be reattempted until the update interval of central has elapsed or updates are forced -> [Help 1]
解决:遇该错误,原因可能是编译命令中有参数写错。。。。(希望你没遇到)
错误2:
+ tar czf 'spark-[info] Compile success at Nov 28, 2017 11:27:10 AM [20.248s]-bin-2.7.3.tgz' -C /zhenglh/new-spark-build/spark-2.1.0 'spark-[info] Compile success at Nov 28, 2017 11:27:10 AM [20.248s]-bin-2.7.3'
tar (child): Cannot connect to spark-[info] Compile success at Nov 28, 2017 11: resolve failed
编译的结果没打包:
spark-[info] Compile success at Nov 28, 2017 11:27:10 AM [20.248s]-bin-2.7.3
这个错误可能第一次编译的人都会遇到
解决:见温馨提示