IDEA import Spark source code and compile

1. Download and configure Spark source code

First download the Spark source code: https://github.com/apache/spark/tree/v2.4.5
official website address:https://github.com/apache/spark

It is best to pull the warehouse locally after compiling on the cloud host and configure the local maven and warehouse address. It may be slow to download on windows. If you can't wait, you can wall it.

You can modify the scala version and hadoop version in the main pom file

<hadoop.version>2.6.0-cdh5.16.2</hadoop.version>
<scala.version>2.12.10</scala.version>
<scala.binary.version>2.12</scala.binary.version>

If necessary, you can add the address of the CDH warehouse to the main pom https://repository.cloudera.com/artifactory/cloudera-repos/

2. Compile Spark source code

Before compiling the Spark source code, you need to modify some things, because the scope stipulates that provided will report ClassNotFoundException

  • Modify the pom.xm file under the hive-thriftserver module
<dependency>
    <groupId>org.eclipse.jetty</groupId>
    <artifactId>jetty-server</artifactId>
    <!--      <scope>provided</scope>-->
</dependency>
<dependency>
    <groupId>org.eclipse.jetty</groupId>
    <artifactId>jetty-servlet</artifactId>
    <!--      <scope>provided</scope>-->
</dependency>

Modify the main pom.xml file

<dependency>
    <groupId>org.eclipse.jetty</groupId>
    <artifactId>jetty-http</artifactId>
    <version>${jetty.version}</version>
    <!--        <scope>provided</scope>-->
</dependency>
<dependency>
    <groupId>org.eclipse.jetty</groupId>
    <artifactId>jetty-continuation</artifactId>
    <version>${jetty.version}</version>
    <!--        <scope>provided</scope>-->
</dependency>
<dependency>
    <groupId>org.eclipse.jetty</groupId>
    <artifactId>jetty-servlet</artifactId>
    <version>${jetty.version}</version>
    <!--        <scope>provided</scope>-->
</dependency>
<dependency>
    <groupId>org.eclipse.jetty</groupId>
    <artifactId>jetty-servlets</artifactId>
    <version>${jetty.version}</version>
    <!--        <scope>provided</scope>-->
</dependency>
<dependency>
    <groupId>org.eclipse.jetty</groupId>
    <artifactId>jetty-proxy</artifactId>
    <version>${jetty.version}</version>
    <!--        <scope>provided</scope>-->
</dependency>
<dependency>
    <groupId>org.eclipse.jetty</groupId>
    <artifactId>jetty-client</artifactId>
    <version>${jetty.version}</version>
    <!--        <scope>provided</scope>-->
</dependency>
<dependency>
    <groupId>org.eclipse.jetty</groupId>
    <artifactId>jetty-util</artifactId>
    <version>${jetty.version}</version>
    <!--        <scope>provided</scope>-->
</dependency>
<dependency>
    <groupId>org.eclipse.jetty</groupId>
    <artifactId>jetty-security</artifactId>
    <version>${jetty.version}</version>
    <!--       <scope>provided</scope>-->
</dependency>
<dependency>
    <groupId>org.eclipse.jetty</groupId>
    <artifactId>jetty-plus</artifactId>
    <version>${jetty.version}</version>
    <!--  <scope>provided</scope>-->
</dependency>
<dependency>
    <groupId>org.eclipse.jetty</groupId>
    <artifactId>jetty-server</artifactId>
    <version>${jetty.version}</version>
    <!--        <scope>provided</scope>-->
</dependency>

将如下换成compile
<dependency>
  <groupId>xml-apis</groupId>
  <artifactId>xml-apis</artifactId>
  <version>1.4.01</version>
  <scope>compile</scope>
</dependency>

<dependency>
  <groupId>com.google.guava</groupId>
  <artifactId>guava</artifactId>
  <scope>compile</scope>
</dependency>

If there are other similar ClassNotFoundExceptions caused by this reason, just comment

Use git-bash to compile, use commands mvn clean package -DskipTests=trueto compile in gitbash
Insert picture description here

3. Import the source code into IDEA

The source code is in Maven mode, after importing IDEA, wait for the dependency loading to complete

Before compiling, you need to delete spark-sqlthe package under the test package streaming, otherwise it will be entered here during Build Project, causing java.lang.OutOfMemoryError: GC overhead limit exceeded. Click to Build Projectcompile

After successful compilation, you can debug SparkSQL

4. Debug SparkSQL locally

Find the hive-thriftserver module, create a new resources directory under main, and mark it as a resource directory.
Copy the following configuration files on the cluster to
hive-site.xml in the resources directory

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
	<property>
		<name>hive.cli.print.header</name>
		<value>true</value>
	</property>
	<property>
		<name>hive.cli.print.current.db</name>
		<value>true</value>
	</property>
	<property>
		<name>hive.metastore.uris</name>
		<value>thrift://hadoop:9083</value>
		<description>指向的是运行metastore服务的主机</description>
	</property>
</configuration>

Note: Only hive-site.xml is needed here

The server needs to start the metastore service

hive --service metastore &

Run SparkSQLCLIDriver

Before running, you need to add parameters in VM options

-Dspark.master=local[2] -Djline.WindowsTerminal.directConsole=false
spark-sql (default)> show databases;
show databases;
databaseName
company
default
hive_function_analyze
skewtest
spark-sql (default)> Time taken: 0.028 seconds, Fetched 10 row(s)

select * from score;

INFO SparkSQLCLIDriver: Time taken: 1.188 seconds, Fetched 4 row(s)
id	name	subject
1	tom	["HuaXue","Physical","Math","Chinese"]
2	jack	["HuaXue","Animal","Computer","Java"]
3	john	["ZheXue","ZhengZhi","SiXiu","history"]
4	alice	["C++","Linux","Hadoop","Flink"]
spark-sql (default)> 

to sum up:

Download the spark source code, use the command line to compile before importing the idea, and then import the idea after the compilation is successful, and then start the build project after importing the idea. At this time, an error will be reported: calss not found. You can Generate Source. The provided project will solve the problem slowly, and finally use the case for testing.

Guess you like

Origin blog.csdn.net/qq_43081842/article/details/105777311