I. Overview
1, needs analysis
Data Format: date user search term urban platform version requirements: 1 , filter out data that satisfy the query conditions (city, platforms, versions) of 2 , the statistics of search uv rankings daily search terms before 3 of 3 , according to top3 search word every day of UV total number of searches, reverse order 4 , save the data to hive table ### data keyword.txt
2018-10-1:leo:water:beijing:android:1.0
2018-10-1:leo1:water:beijing:android:1.0
2018-10-1:leo2:water:beijing:android:1.0
2018-10-1:jack:water:beijing:android:1.0
2018-10-1:jack1:water:beijing:android:1.0
2018-10-1:leo:seafood:beijing:android:1.0
2018-10-1:leo1:seafood:beijing:android:1.0
2018-10-1:leo2:seafood:beijing:android:1.0
2018-10-1:leo:food:beijing:android:1.0
2018-10-1:leo1:food:beijing:android:1.0
2018-10-1:leo2:meat:beijing:android:1.0
2018-10-2:leo:water:beijing:android:1.0
2018-10-2:leo1:water:beijing:android:1.0
2018-10-2:leo2:water:beijing:android:1.0
2018-10-2:jack:water:beijing:android:1.0
2018-10-2:leo1:seafood:beijing:android:1.0
2018-10-2:leo2:seafood:beijing:android:1.0
2018-10-2:leo3:seafood:beijing:android:1.0
2018-10-2:leo1:food:beijing:android:1.0
2018-10-2:leo2:food:beijing:android:1.0
2018-10-2:leo:meat:beijing:android:1.0
####
1, if the case using txt text editor to save the ANSI text format, or when groupByKey, the default will be the first line of a space, failed packets.
2, the final text of the ban appear blank lines, or when the split will be given the wrong array bounds appear;
2, thinking
1 , for the original data (HDFS file), obtaining input RDD 2 , using the filter operator, for the data input to the RDD, data filtering, filter out data that satisfy the query criteria. 2.1 common practice: directly fitler operator functions using an external query (Map), but to do so, is not the query Map, will send a copy to every task. (Not good performance) 2.2 practices optimized: the query conditions, packaged as a broadcast Broadcast variables, variables using Broadcast data broadcast operator in the screening filter. 3 , the data is converted to "(date _ search terms, user)" format, then, it grouped, and then mapped again, for each day the search user search terms to be re-operation, after de-emphasis and statistics the number of each search word uv is the day. Finally, get "(date _ search terms, uv)" 4 , uv day each search word will get, RDD, mapped to the elements of type Row of RDD, the RDD converted to DataFrame 5 , will be registered as a temporary DataFrame table, use Spark SQL windowing function, to count the day before the search word uv ranked number 3, as well as its search uv, and finally get is a DataFrame 6 , the DataFrame converted to RDD, continue, day by day to grouping and mapping, uv calculate the total number of searches per day top3 search word, then the total number of uv as a key, the day of the search word and a search top3 number, a string of splicing 7, According to the daily search top3 total uv, sort, descending sort 8 , the data row good sequence, mapping back again, into a "date _ the search term _uv" format 9, mapped again DataFrame, and save the data Hive in to the
Two, java achieve
package cn.spark.study.sql; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.PairFunction; import org.apache.spark.broadcast.Broadcast; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SQLContext; import org.apache.spark.sql.hive.HiveContext; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; import scala.Tuple2; import java.util.*; public class DailyTop3Keyword { @SuppressWarnings("deprecation") public static void main(String[] args) { SparkConf conf = new SparkConf(); JavaSparkContext jsc = newJavaSparkContext (the conf); SqlContext SqlContext = new new HiveContext (jsc.sc ()); // falsified data (which may be from mysql database) Final the HashMap <String, List <String >> queryParaMap = new new the HashMap <String, List <String >> (); queryParaMap.put ( "City", Arrays.asList ( "beijing" )); queryParaMap.put ( "Platform", Arrays.asList ( "Android" )); queryParaMap.put ( "Version", Arrays .asList ( "1.0", "1.2", "2.0", "for 1.5" )); // the data broadcasting Final broadcast <the HashMap <String, List <String>>> queryParamMapBroadcast = jsc.broadcast(queryParaMap); // for HDFS log file, obtain input RDD JavaRDD <String> rowRDD = jsc.textFile ( "HDFS: // spark1: 9000 / Spark-Study / keyword.txt" ); // filter operator filtered JavaRDD < String> filterRDD = rowRDD.filter ( new new Function <String, Boolean> () { Private static Final Long serialVersionUID = 1L ; @Override public Boolean Call (String log) throws Exception { // cut raw logs for the city, platforms and versions String [] = logSplit log.split ( ":" ); String City = logSplit [. 3 ]; String Platform = logSplit [. 4 ]; String Version = logSplit [. 5 ]; // to compare the query condition, any condition as long as the conditions are set, and the data condition is not satisfied in the log // directly returns false, the log was filtered off // otherwise, if all the conditions are set, there is data in the log, it returns true, logs to retain the HashMap <String, List <String >> queryParamMap = queryParamMapBroadcast.value (); List <String> = Cities queryParamMap.get ( "City" ); IF (! cities.contains (City) && cities.size ()> 0 ) { return to false ; } List<String> = queryParamMap.get Platforms ( "Platform" ); IF (! {Platforms.contains (Platform)) return to false ; } List <String> = queryParamMap.get versions ( "Version" ); IF (! Versions.contains (Version)) { return to false ; } return to true ; } }); // filtered raw log mapped to (_ date search term, the user) format JavaPairRDD <String, String> dateKeyWordUserRDD = filterRDD.mapToPair ( new new PairFunction<String, String, String>() { private static final long serialVersionUID = 1L; @Override public Tuple2<String, String> call(String log) throws Exception { String[] logSplit = log.split(":"); String date = logSplit[0]; String user = logSplit[1]; String keyword = logSplit[2]; return new Tuple2<String, String>(date + "_" + keyword, user); } }); // grouping, get each search word a day, which users have searched (no de-emphasis) JavaPairRDD <String, Iterable <String >> dateKeywordUsersRDD = dateKeyWordUserRDD.groupByKey (); List <Tuple2 <String, Iterable < >>> collect1 = String dateKeywordUsersRDD.collect (); for (Tuple2 <String, Iterable <String >> Tuple2: collect1) { System.out.println ( "group, acquire each search word a day, which users have searched (not to weight) "+ tuple2._2); System.out.println (Tuple2); } // for each search word a day a user to re-search operation is obtained before UV JavaPairRDD <String, Long> dateKeywordUvRDD = dateKeywordUsersRDD.mapToPair (new PairFunction<Tuple2<String, Iterable<String>>, String, Long>() { private static final long serialVersionUID = 1L; @Override public Tuple2<String, Long> call(Tuple2<String, Iterable<String>> dataKeywordUsers) throws Exception { String dateKeyword = dataKeywordUsers._1; Iterator<String> users = dataKeywordUsers._2.iterator(); // 对用户去重 并统计去重后的数量 List<String> distinctUsers = new ArrayList<String>(); while (users.hasNext()) { String user = users.next(); if (!distinctUsers.contains(user)) { distinctUsers.add(user); } } // 获取uv long uv = distinctUsers.size(); // 日期_搜索词,用户个数 return new Tuple2<String, Long>(dateKeyword, uv); } }); List<Tuple2<String, Long>> collect2 =dateKeywordUvRDD.collect (); for (Tuple2 <String, Long> stringLongTuple2: collect2 to) { System.out.println ( "per day for each search word a user to re-search operation is obtained before UV" ); System.out.println ( stringLongTuple2); } // the daily uv each search word data, converted into DataFrame JavaRDD <Row> dateKeywordUvRowRDD = dateKeywordUvRDD.map ( new new Function <Tuple2 <String, Long>, Row> () { Private static Final Long serialVersionUID = 1L ; @Override public Row Call (Tuple2 <String, Long> dateKeywordUv) throws Exception { String DATE= dateKeywordUv._1.split("_")[0]; String keyword = dateKeywordUv._1.split("_")[1]; long uv = dateKeywordUv._2; return RowFactory.create(date, keyword, uv); } }); ArrayList<StructField> fields = new ArrayList<StructField>(); fields.add(DataTypes.createStructField("date", DataTypes.StringType, true)); fields.add(DataTypes.createStructField("keyword", DataTypes.StringType, true)); fields.add (DataTypes.createStructField ( "UV", DataTypes.LongType, to true )); StructType StructType = DataTypes.createStructType (Fields); DataFrame dateKeywordUvDF = sqlContext.createDataFrame (dateKeywordUvRowRDD, StructType); dateKeywordUvDF.registerTempTable ( "Sales" ) ; // use windowing function, statistical search every day uv top three hot spots search terms // Search number of words the number of the top three Final DataFrame dailyTop3KeyWordDF = sqlContext.sql ( "the SELECT dATE, keyword, uv from (the SELECT dATE, keyword, UV, ROW_NUMBER () over (Order by UV Partition by DATE DESC) Rank from Sales) tmp_sales WHERE Rank <=. 3 " ); // to convert DataFrame RDD, mapping, JavaRDD<Row> dailyTop3KeyWordRDD = dailyTop3KeyWordDF.javaRDD(); JavaPairRDD<String, String> dailyTop3KeywordRDD = dailyTop3KeyWordRDD.mapToPair(new PairFunction<Row, String, String>() { private static final long serialVersionUID = 1L; @Override public Tuple2<String, String> call(Row row) throws Exception { String date = String.valueOf(row.get(0)); String keyword = String.valueOf(row.get(1)); String uv String.valueOf = (row.get (2 )); // mapped to date the total number of search terms _ return new new Tuple2 <String, String> (DATE, keyword + "_" + UV); } }); List < Tuple2 <String, String >> the collect = dailyTop3KeywordRDD.collect (); for (Tuple2 <String, String> stringStringTuple2: the collect) { System.out.println ( "windowing function operation" ); System.out.println (stringStringTuple2) ; } // The date packet JavaPairRDD <String, the Iterable <String >> top3DateKeywordsRDD = dailyTop3KeywordRDD.groupByKey (); // map JavaPairRDD<Long, String> uvDateKeywordsRDD = top3DateKeywordsRDD.mapToPair(new PairFunction<Tuple2<String, Iterable<String>>, Long, String>() { private static final long serialVersionUID = 1L; @Override public Tuple2<Long, String> call(Tuple2<String, Iterable<String>> tuple) throws Exception { String date = tuple._1; // 搜索词_总个数 集合 Iterator<String> KeyWordUviterator = tuple._2.iterator(); long totalUv = 0L; String dateKeyword = date; while (KeyWordUviterator.hasNext()) { // 搜索词_个数 String keywoarUv = KeyWordUviterator.next(); Long uv = Long.valueOf(keywoarUv.split("_")[1]); totalUv += uv; dateKeyword = dateKeyword + "," + keywoarUv; } return new Tuple2<Long, String>(totalUv, dateKeyword); } }); JavaPairRDD<Long, String> sortedUvDateKeywordsRDD = uvDateKeywordsRDD.sortByKey(false); List<Tuple2<Long, String>> rows = sortedUvDateKeywordsRDD.collect(); for (Tuple2<Long, String> row : rows) { System.out.println(row._2 + " " + row._1); } // 映射 JavaRDD<Row> resultRDD = sortedUvDateKeywordsRDD.flatMap(new FlatMapFunction<Tuple2<Long, String>, Row>() { private static final long serialVersionUID = 1L; @Override public Iterable<Row> call(Tuple2<Long, String> tuple) throws Exception { String dateKeywords = tuple._2; String[] dateKeywordsSplit = dateKeywords.split(","); String date = dateKeywordsSplit[0]; ArrayList<Row> rows = new ArrayList<Row>(); rows.add(RowFactory.create(date, dateKeywordsSplit[1].split("_")[0], Long.valueOf(dateKeywordsSplit[1].split("_")[1]))); rows.add(RowFactory.create(date, dateKeywordsSplit[2] .split ( "_") [0 ], Long.valueOf (dateKeywordsSplit [ 2] .split ( "_") [. 1 ]))); Rows.Add (RowFactory.create (DATE, dateKeywordsSplit [ . 3]. Split ( "_") [0 ], Long.valueOf (dateKeywordsSplit [ . 3] .split ( "_") [. 1 ]))); return rows; } }); // the final data is converted to DataFrame, Hive table and saved to DataFrame finalDF = sqlContext.createDataFrame (resultRDD, StructType); // List <Row> rows1 = finalDF.javaRDD () the collect ();. // for (Row Row: rows1) { // the System. out. println(row); // } finalDF.saveAsTable("daily_top3_keyword_uv"); jsc.close(); } }