【甘道夫】Java Hello World on Spark

引言
通过Java编写Spark应用程序的HelloWorld,虽然有点寒碜,没用Scala简洁明了,但还是得尝试和记录下。

环境
Windows7
Eclipse+Maven
Jdk1.7
Ubuntu 14.04

步骤一:在eclipse中创建maven工程,过程很简单,不详述。
pom文件为:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <groupId>edu.berkeley</groupId>
  <artifactId>SparkProj</artifactId>
  <modelVersion>4.0.0</modelVersion>
  <name>Spark Project</name>
  <packaging>jar</packaging>
  <version>1.0</version>
  <dependencies>
    <dependency> <!-- Spark dependency -->
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.10</artifactId>
      <version>1.3.0</version>
    </dependency>
  </dependencies>
</project>  

步骤二:编写核心逻辑代码
功能很简单,统计集群中Spark根目录下README.md中有多少行包含a,多少行包含b。
说实话该功能用Scala编写十分简单,Java实在是恶心。

package edu.berkeley.SparkProj;

/* SimpleApp.java */
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;

public class SimpleApp {
  public static void main(String[] args) {
    String logFile = "file:///home/fulong/Spark/spark-1.3.0-bin-hadoop2.4/README.md"; // Should be some file on your system
    SparkConf conf = new SparkConf().setAppName("Simple Application");
    JavaSparkContext sc = new JavaSparkContext(conf);
    JavaRDD<String> logData = sc.textFile(logFile).cache();

    long numAs = logData.filter(new Function<String, Boolean>() {
      public Boolean call(String s) { return s.contains("a"); }
    }).count();

    long numBs = logData.filter(new Function<String, Boolean>() {
      public Boolean call(String s) { return s.contains("b"); }
    }).count();

    System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
  }
}

步骤三:Windows下CMD中进入到工程根目录下打包
D:\WorkSpace2015\SparkProj>mvn package
生成jar包:
D:\WorkSpace2014\SparkProj\target\SparkProj-1.0.jar

步骤四:将该包通过WinSCP工具拷贝到集群某节点的目录下
/home/fulong/Workspace/Spark/ SparkProj-1.0.jar

最后一步:通过spark-submit提交程序到Spark集群
fulong@FBI016:~/Spark/spark-1.3.0-bin-hadoop2.4$  ./bin/spark-submit --class edu.berkeley.SparkProj.SimpleApp --master yarn-client /home/fulong/Workspace/Spark/SparkProj-1.0.jar

运行结果,包含a的有60行,包含b的有29行:

猜你喜欢

转载自blog.csdn.net/u010967382/article/details/45100101
今日推荐