开始学点 Spark。做了第一个小例子,记录一下 ^_^
背景
有个退款文件如下:
仅退款,E20190201001,I001,0.01,0.01
退货退款,E20190201002,I002,0.01,0.01
退货退款,E20190201003,I003,1.2,1.2
退货退款,E20190201004,I004,10.9,10.9
仅退款,E20190201004,I005,10.9,10.9
仅退款,E20190201005,I006,2,1
仅退款,E20190201006,I007,0.18,0.05
打算用 Spark 来处理它。
pom文件
使用 spark 最新版本, opencsv 来读取 csv 文件。
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>com.thoughtworks.paranamer</groupId>
<artifactId>paranamer</artifactId>
<version>2.8</version>
</dependency>
<dependency>
<groupId>net.sf.opencsv</groupId>
<artifactId>opencsv</artifactId>
<version>2.3</version>
</dependency>
代码
BasicSpark.java
package zzz.spark;
import au.com.bytecode.opencsv.CSVReader;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import java.io.IOException;
import java.io.StringReader;
import java.util.stream.Collectors;
import java.util.stream.StreamSupport;
public class BasicSpark {
public static void main(String[] args) {
JavaSparkContext sc = buildSparkContext();
JavaRDD<String> rdd = sc.hadoopFile("/Users/shuqin/Downloads/refund.csv", TextInputFormat.class,
LongWritable.class, Text.class).map(pair -> new String(pair._2.getBytes(), 0, pair._2.getLength(), "GBK"));
JavaRDD<RefundInfo> refundInfos = rdd.map(BasicSpark::parseLine).map(RefundInfo::from);
System.out.println("refund info number: " + refundInfos.count());
JavaRDD<RefundInfo> filtered = refundInfos.filter(refundInfo -> refundInfo.getRealPrice() >=10 );
System.out.println("realPrice > 10: " + filtered.collect().stream().map(RefundInfo::getOrderNo).collect(Collectors.joining()));
JavaPairRDD<String, Iterable<RefundInfo>> grouped = refundInfos.groupBy(RefundInfo::getType);
JavaPairRDD<String, Double> groupedRealPaySumRDD = grouped.mapValues(info -> StreamSupport.stream(info.spliterator(),false).mapToDouble(RefundInfo::getRealPrice).sum());
System.out.println("groupedRealPaySum: " + groupedRealPaySumRDD.collectAsMap());
JavaPairRDD<String, Long> groupedNumberRDD = grouped.mapValues(info -> StreamSupport.stream(info.spliterator(),false).count());
System.out.println("groupedNumber: " + groupedNumberRDD.collectAsMap());
}
public static JavaSparkContext buildSparkContext() {
SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("learningSparkInJava")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
return new JavaSparkContext(sparkConf);
}
public static String[] parseLine(String line) {
try {
CSVReader reader = new CSVReader(new StringReader(line));
return reader.readNext();
} catch (IOException e) {
return new String[0];
}
}
}
RefundInfo.java
package zzz.spark;
import lombok.Data;
@Data
public class RefundInfo {
private String type; // 退款方式
private String orderNo; // 订单编号
private String goodsTitle; // 商品名称
private Double realPrice; // 订单金额
private Double refund; // 退款金额
public static RefundInfo from(String[] arr) {
if (arr == null || arr.length != 5) {
return null;
}
RefundInfo refundInfo = new RefundInfo();
refundInfo.setType(arr[0]);
refundInfo.setOrderNo(arr[1]);
refundInfo.setGoodsTitle(arr[2]);
refundInfo.setRealPrice(Double.valueOf(arr[3]));
refundInfo.setRefund(Double.valueOf(arr[4]));
return refundInfo;
}
}
讲解
Java Spark 有两种操作: 一种将 RDD 转换成另一种 RDD, 是惰性的; 一种是从 RDD 生成结果。
RDD 有两种,一种是列表型的,一种是Map型的。
代码都在上面了,相信有一定 java stream 基础的读者是可以看懂的。
问题解决
P1. Exception in thread "main" java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V
解决方案:scala-library 2.11.8 -> 2.12.0-RC2
P2. in thread "main" java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.
解决方法:guava 23.0 -> 15.0
P3. object not serializable.
解决方法: JavaSparkContext.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
P4. 中文处理
使用 sc.hadoopFile(path, TextInputFormat.class,
LongWritable.class, Text.class).map(pair -> new String(pair._2.getBytes(), 0, pair._2.getLength(), "GBK")); 而不是 sc.textFile(path)
【未完待续】