1. Download and configure Spark source code
First download the Spark source code: https://github.com/apache/spark/tree/v2.4.5
official website address:https://github.com/apache/spark
It is best to pull the warehouse locally after compiling on the cloud host and configure the local maven and warehouse address. It may be slow to download on windows. If you can't wait, you can wall it.
You can modify the scala version and hadoop version in the main pom file
<hadoop.version>2.6.0-cdh5.16.2</hadoop.version>
<scala.version>2.12.10</scala.version>
<scala.binary.version>2.12</scala.binary.version>
If necessary, you can add the address of the CDH warehouse to the main pom https://repository.cloudera.com/artifactory/cloudera-repos/
2. Compile Spark source code
Before compiling the Spark source code, you need to modify some things, because the scope stipulates that provided will report ClassNotFoundException
- Modify the pom.xm file under the hive-thriftserver module
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-server</artifactId>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-servlet</artifactId>
<!-- <scope>provided</scope>-->
</dependency>
Modify the main pom.xml file
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-http</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-continuation</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-servlet</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-servlets</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-proxy</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-client</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-util</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-security</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-plus</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-server</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
将如下换成compile
<dependency>
<groupId>xml-apis</groupId>
<artifactId>xml-apis</artifactId>
<version>1.4.01</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<scope>compile</scope>
</dependency>
If there are other similar ClassNotFoundExceptions caused by this reason, just comment
Use git-bash to compile, use commands mvn clean package -DskipTests=true
to compile in gitbash
3. Import the source code into IDEA
The source code is in Maven mode, after importing IDEA, wait for the dependency loading to complete
Before compiling, you need to delete spark-sql
the package under the test package streaming
, otherwise it will be entered here during Build Project, causing java.lang.OutOfMemoryError: GC overhead limit exceeded. Click to Build Project
compile
After successful compilation, you can debug SparkSQL
4. Debug SparkSQL locally
Find the hive-thriftserver module, create a new resources directory under main, and mark it as a resource directory.
Copy the following configuration files on the cluster to
hive-site.xml in the resources directory
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.cli.print.header</name>
<value>true</value>
</property>
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://hadoop:9083</value>
<description>指向的是运行metastore服务的主机</description>
</property>
</configuration>
Note: Only hive-site.xml is needed here
The server needs to start the metastore service
hive --service metastore &
Run SparkSQLCLIDriver
Before running, you need to add parameters in VM options
-Dspark.master=local[2] -Djline.WindowsTerminal.directConsole=false
spark-sql (default)> show databases;
show databases;
databaseName
company
default
hive_function_analyze
skewtest
spark-sql (default)> Time taken: 0.028 seconds, Fetched 10 row(s)
select * from score;
INFO SparkSQLCLIDriver: Time taken: 1.188 seconds, Fetched 4 row(s)
id name subject
1 tom ["HuaXue","Physical","Math","Chinese"]
2 jack ["HuaXue","Animal","Computer","Java"]
3 john ["ZheXue","ZhengZhi","SiXiu","history"]
4 alice ["C++","Linux","Hadoop","Flink"]
spark-sql (default)>
to sum up:
Download the spark source code, use the command line to compile before importing the idea, and then import the idea after the compilation is successful, and then start the build project after importing the idea. At this time, an error will be reported: calss not found. You can Generate Source. The provided project will solve the problem slowly, and finally use the case for testing.