Spark source code upside down

This article demonstrates that spark source code compiles and submits tasks on the idea editor

1. Download the spark source code from the website, click VCS->CheckOut form Version Control->Git in the idea to download the code to the local

        https://github.com/apache/spark

       

 


 2. In order to make local compilation faster, set the maven source of oschina to the parent pom.xml

  

<repository>
	<id>nexus</id>
        <name>local private nexus</name>
        <url>http://maven.oschina.net/content/groups/public/</url>
        <releases>
	 <enabled>true</enabled>
	</releases>
	<snapshots>
	    <enabled>false</enabled>
	 </snapshots>
    </repository>

 

 

3. Execute mvn clean idea:idea By the way, the local mvn environment should be above 3.3.x

4. Open the idea, click the menu File->Open-> Find the spark source code in the pop-up window and open it

5. Find Master.scala and click Run;

  

 Then check in the browser to see if the operation is successful



 6. Then run the worker, find Worker.scala, right click to run, but you need to do some settings here, add the parameter --webui-port 8081 spark://192.168.3.107:7077 in Program arguments: (Note: Sometimes localhost is not Very useful, depending on the system settings, it is best to use ip)
 

 At this time, refresh the browser and you will find that the worker has been registered



 8. Let's write a spark program and submit it for execution. Create a new scala maven project to count user information. See the attachment for the project download. The core code is as follows:

package com.zhanjun.spark.learn

import org.apache.spark. {SparkConf, SparkContext}

object UserInfoCount {
  def main(args: Array[String]) {
    if (args.length == 0) {
      System.err.println("Usage: UserInfoCount <file1> <file2>")
      System.exit(1)
    }
    val conf = new SparkConf().setAppName("UserInfoCount")
    val sc = new SparkContext(conf)
    // Read the data source, separate the data with ",", and filter each row of data into 8 fields
    val userInfoRDD = sc.textFile(args(0)).map(_.split(",")).filter(_.length == 8)
    // Count according to the region, where the region field is the fourth field, and after the total, it is sorted according to the statistics
    val blockCountRDD = userInfoRDD.map(x => (x(3), 1)).reduceByKey(_ + _).map(x => (x._2, x._1)).sortByKey(false).map(x => (x._2, x._1))
    // According to the mobile phone number according to the first three numbers, the mobile phone number is the third field, and the total is sorted according to the first three digits of the mobile phone number (from small to large)
    val phoneCountRDD = userInfoRDD.map(x => (x(2).substring(0, 3), 1)).reduceByKey(_ + _).sortByKey(true)
    // Merge the two parts of data and output to HDFS
    val unionRDD = blockCountRDD.union (phoneCountRDD)
    //repartition sets the number of partitions in the RDD
    unionRDD.repartition(1).saveAsTextFile(args(1))
    sc.stop()
  }
}

 Package the project through mvn clean package, and extract users_txt.zip to the corresponding directory.

 

9. Let's go back to the idea of ​​the spark source code, submit the job through the idea and try to find org.apache.spark.deploy.SparkSubmit, right-click to run, and then set the corresponding parameters

 

--class
com.zhanjun.spark.learn.UserInfoCount
--master
local
/home/admin/workspace/spark-work/UserInfoCount/target/UserInfoCount-1.0-SNAPSHOT.jar
file:///home/admin/temp/users.txt
file:///home/admin/temp/output/

 

 

We can find that the corresponding calculation results will be generated under the /home/admin/temp/output directory.



 

In the process, the source code downloaded from the spark official website was used. It seems that the initialization of akka failed when sparksubmit, and then there is no problem with downloading the code from git.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326757486&siteId=291194637