48, and the integration of the Spark SQL Spark Core daily top3 real case of hot search word statistics

I. Overview

1, needs analysis

Data Format: 
date user search term urban platform version 

requirements:
 1 , filter out data that satisfy the query conditions (city, platforms, versions) of
 2 , the statistics of search uv rankings daily search terms before 3 of
 3 , according to top3 search word every day of UV total number of searches, reverse order
 4 , save the data to hive table 





### data keyword.txt

2018-10-1:leo:water:beijing:android:1.0
2018-10-1:leo1:water:beijing:android:1.0
2018-10-1:leo2:water:beijing:android:1.0
2018-10-1:jack:water:beijing:android:1.0
2018-10-1:jack1:water:beijing:android:1.0
2018-10-1:leo:seafood:beijing:android:1.0
2018-10-1:leo1:seafood:beijing:android:1.0
2018-10-1:leo2:seafood:beijing:android:1.0
2018-10-1:leo:food:beijing:android:1.0
2018-10-1:leo1:food:beijing:android:1.0
2018-10-1:leo2:meat:beijing:android:1.0
2018-10-2:leo:water:beijing:android:1.0
2018-10-2:leo1:water:beijing:android:1.0
2018-10-2:leo2:water:beijing:android:1.0
2018-10-2:jack:water:beijing:android:1.0
2018-10-2:leo1:seafood:beijing:android:1.0
2018-10-2:leo2:seafood:beijing:android:1.0
2018-10-2:leo3:seafood:beijing:android:1.0
2018-10-2:leo1:food:beijing:android:1.0
2018-10-2:leo2:food:beijing:android:1.0
2018-10-2:leo:meat:beijing:android:1.0


####

1, if the case using txt text editor to save the ANSI text format, or when groupByKey, the default will be the first line of a space, failed packets.

2, the final text of the ban appear blank lines, or when the split will be given the wrong array bounds appear;


2, thinking

1 , for the original data (HDFS file), obtaining input RDD

 2 , using the filter operator, for the data input to the RDD, data filtering, filter out data that satisfy the query criteria.
  2.1 common practice: directly fitler operator functions using an external query (Map), but to do so, is not the query Map, 
  will send a copy to every task. (Not good performance)
   2.2 practices optimized: the query conditions, packaged as a broadcast Broadcast variables, variables using Broadcast data broadcast operator in the screening filter. 
  
3 , the data is converted to "(date _ search terms, user)" format, then, it grouped, and then mapped again, for each day the search user search terms to be re-operation, 
after de-emphasis and statistics the number of each search word uv is the day. Finally, get "(date _ search terms, uv)"

 4 , uv day each search word will get, RDD, mapped to the elements of type Row of RDD, the RDD converted to DataFrame

 5 , will be registered as a temporary DataFrame table, use Spark SQL windowing function, to count the day before the search word uv ranked number 3, as well as its search uv, and finally get is a DataFrame

 6 , the DataFrame converted to RDD, continue, day by day to grouping and mapping, uv calculate the total number of searches per day top3 search word, then the total number of uv as a key, 
the day of the search word and a search top3 number, a string of splicing

 7, According to the daily search top3 total uv, sort, descending sort

 8 , the data row good sequence, mapping back again, into a "date _ the search term _uv" format

 9, mapped again DataFrame, and save the data Hive in to the


Two, java achieve

package cn.spark.study.sql;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.broadcast.Broadcast;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import scala.Tuple2;

import java.util.*;

public class DailyTop3Keyword {
    @SuppressWarnings("deprecation")
    public static void main(String[] args) {
        SparkConf conf = new SparkConf();
        JavaSparkContext jsc = newJavaSparkContext (the conf); 
        SqlContext SqlContext = new new HiveContext (jsc.sc ()); 
        
        // falsified data (which may be from mysql database) 
        Final the HashMap <String, List <String >> queryParaMap = new new the HashMap <String, List <String >> (); 
        queryParaMap.put ( "City", Arrays.asList ( "beijing" )); 
        queryParaMap.put ( "Platform", Arrays.asList ( "Android" )); 
        queryParaMap.put ( "Version", Arrays .asList ( "1.0", "1.2", "2.0", "for 1.5" )); 
        
        // the data broadcasting 
        Final broadcast <the HashMap <String, List <String>>> queryParamMapBroadcast = jsc.broadcast(queryParaMap);
        
        // for HDFS log file, obtain input RDD 
        JavaRDD <String> rowRDD = jsc.textFile ( "HDFS: // spark1: 9000 / Spark-Study / keyword.txt" ); 
        
        // filter operator filtered 
        JavaRDD < String> filterRDD = rowRDD.filter ( new new Function <String, Boolean> () { 

            Private  static  Final  Long serialVersionUID = 1L ; 

            @Override 
            public Boolean Call (String log) throws Exception {
                 // cut raw logs for the city, platforms and versions 
                String [] = logSplit log.split ( ":" ); 
                String City = logSplit [. 3  ];
                String Platform = logSplit [. 4 ]; 
                String Version = logSplit [. 5 ]; 
                
                // to compare the query condition, any condition as long as the conditions are set, and the data condition is not satisfied in the log
                 // directly returns false, the log was filtered off
                 // otherwise, if all the conditions are set, there is data in the log, it returns true, logs to retain 
                the HashMap <String, List <String >> queryParamMap = queryParamMapBroadcast.value (); 
                List <String> = Cities queryParamMap.get ( "City" );
                 IF (! cities.contains (City) && cities.size ()> 0 ) {
                     return  to false ; 
                } 
                List<String> = queryParamMap.get Platforms ( "Platform" );
                 IF (! {Platforms.contains (Platform))
                     return  to false ; 
                } 
                List <String> = queryParamMap.get versions ( "Version" );
                 IF (! Versions.contains (Version)) {
                     return  to false ; 
                } 

                return  to true ; 
            } 
        }); 
        
        // filtered raw log mapped to (_ date search term, the user) format 
        JavaPairRDD <String, String> dateKeyWordUserRDD = filterRDD.mapToPair ( new new PairFunction<String, String, String>() {

            private static final long serialVersionUID = 1L;

            @Override
            public Tuple2<String, String> call(String log) throws Exception {
                String[] logSplit = log.split(":");
                String date = logSplit[0];
                String user = logSplit[1];
                String keyword = logSplit[2];
                return new Tuple2<String, String>(date + "_" + keyword, user);
            }
        }); 
        
        // grouping, get each search word a day, which users have searched (no de-emphasis) 
        JavaPairRDD <String, Iterable <String >> dateKeywordUsersRDD = dateKeyWordUserRDD.groupByKey (); 
        List <Tuple2 <String, Iterable < >>> collect1 = String dateKeywordUsersRDD.collect ();
         for (Tuple2 <String, Iterable <String >> Tuple2: collect1) { 
            System.out.println ( "group, acquire each search word a day, which users have searched (not to weight) "+ tuple2._2); 
            System.out.println (Tuple2); 
        } 

        // for each search word a day a user to re-search operation is obtained before UV 
        JavaPairRDD <String, Long> dateKeywordUvRDD = dateKeywordUsersRDD.mapToPair
                (new PairFunction<Tuple2<String, Iterable<String>>, String, Long>() {

                    private static final long serialVersionUID = 1L;

                    @Override
                    public Tuple2<String, Long> call(Tuple2<String, Iterable<String>> dataKeywordUsers) throws Exception {
                        String dateKeyword = dataKeywordUsers._1;
                        Iterator<String> users = dataKeywordUsers._2.iterator();
                        // 对用户去重   并统计去重后的数量
                        List<String> distinctUsers = new ArrayList<String>();
                        while (users.hasNext()) {
                            String user = users.next();
                            if (!distinctUsers.contains(user)) {
                                distinctUsers.add(user);
                            }
                        }
                        // 获取uv
                        long uv = distinctUsers.size();
                        // 日期_搜索词,用户个数
                        return new Tuple2<String, Long>(dateKeyword, uv);
                    }
                });
        List<Tuple2<String, Long>> collect2 =dateKeywordUvRDD.collect ();
         for (Tuple2 <String, Long> stringLongTuple2: collect2 to) { 
            System.out.println ( "per day for each search word a user to re-search operation is obtained before UV" ); 
            System.out.println ( stringLongTuple2); 
        } 


        // the daily uv each search word data, converted into DataFrame 
        JavaRDD <Row> dateKeywordUvRowRDD = dateKeywordUvRDD.map ( new new Function <Tuple2 <String, Long>, Row> () { 

            Private  static  Final  Long serialVersionUID = 1L ; 

            @Override 
            public Row Call (Tuple2 <String, Long> dateKeywordUv) throws  Exception {
                String DATE= dateKeywordUv._1.split("_")[0];
                String keyword = dateKeywordUv._1.split("_")[1];
                long uv = dateKeywordUv._2;
                return RowFactory.create(date, keyword, uv);
            }
        });
        ArrayList<StructField> fields = new ArrayList<StructField>();
        fields.add(DataTypes.createStructField("date", DataTypes.StringType, true));
        fields.add(DataTypes.createStructField("keyword", DataTypes.StringType, true));
        fields.add (DataTypes.createStructField ( "UV", DataTypes.LongType, to true )); 
        StructType StructType = DataTypes.createStructType (Fields); 
        DataFrame dateKeywordUvDF = sqlContext.createDataFrame (dateKeywordUvRowRDD, StructType); 
        dateKeywordUvDF.registerTempTable ( "Sales" ) ; 
        
        // use windowing function, statistical search every day uv top three hot spots search terms
         // Search number of words the number of the top three 
        Final DataFrame dailyTop3KeyWordDF = sqlContext.sql ( "the SELECT dATE, keyword, uv from (the SELECT dATE, keyword, UV, ROW_NUMBER () over (Order by UV Partition by DATE DESC) Rank from Sales) tmp_sales WHERE Rank <=. 3 " );
         // to convert DataFrame RDD, mapping,
        JavaRDD<Row> dailyTop3KeyWordRDD = dailyTop3KeyWordDF.javaRDD();

        JavaPairRDD<String, String> dailyTop3KeywordRDD = dailyTop3KeyWordRDD.mapToPair(new PairFunction<Row, String, String>() {

            private static final long serialVersionUID = 1L;

            @Override
            public Tuple2<String, String> call(Row row) throws Exception {
                String date = String.valueOf(row.get(0));
                String keyword = String.valueOf(row.get(1));
                String uv String.valueOf = (row.get (2 ));
                 // mapped to date the total number of search terms _ 
                return  new new Tuple2 <String, String> (DATE, keyword + "_" + UV); 
            } 
        }); 

        List < Tuple2 <String, String >> the collect = dailyTop3KeywordRDD.collect ();
         for (Tuple2 <String, String> stringStringTuple2: the collect) { 
            System.out.println ( "windowing function operation" ); 
            System.out.println (stringStringTuple2) ; 
        } 


        // The date packet 
        JavaPairRDD <String, the Iterable <String >> top3DateKeywordsRDD = dailyTop3KeywordRDD.groupByKey ();
         // map
        JavaPairRDD<Long, String> uvDateKeywordsRDD = top3DateKeywordsRDD.mapToPair(new PairFunction<Tuple2<String, Iterable<String>>, Long, String>() {

            private static final long serialVersionUID = 1L;

            @Override
            public Tuple2<Long, String> call(Tuple2<String, Iterable<String>> tuple) throws Exception {
                String date = tuple._1;
                // 搜索词_总个数  集合
                Iterator<String> KeyWordUviterator = tuple._2.iterator();
                long totalUv = 0L;
                String dateKeyword = date;
                while (KeyWordUviterator.hasNext()) {
                    // 搜索词_个数
                    String keywoarUv = KeyWordUviterator.next();
                    Long uv = Long.valueOf(keywoarUv.split("_")[1]);
                    totalUv += uv;
                    dateKeyword = dateKeyword + "," + keywoarUv;
                }

                return new Tuple2<Long, String>(totalUv, dateKeyword);
            }
        });
        JavaPairRDD<Long, String> sortedUvDateKeywordsRDD = uvDateKeywordsRDD.sortByKey(false);
        List<Tuple2<Long, String>> rows = sortedUvDateKeywordsRDD.collect();
        for (Tuple2<Long, String> row : rows) {
            System.out.println(row._2 + "    " + row._1);
        }


        // 映射
        JavaRDD<Row> resultRDD = sortedUvDateKeywordsRDD.flatMap(new FlatMapFunction<Tuple2<Long, String>, Row>() {

            private static final long serialVersionUID = 1L;

            @Override
            public Iterable<Row> call(Tuple2<Long, String> tuple) throws Exception {
                String dateKeywords = tuple._2;
                String[] dateKeywordsSplit = dateKeywords.split(",");
                String date = dateKeywordsSplit[0];
                ArrayList<Row> rows = new ArrayList<Row>();
                rows.add(RowFactory.create(date, dateKeywordsSplit[1].split("_")[0],
                        Long.valueOf(dateKeywordsSplit[1].split("_")[1])));

                rows.add(RowFactory.create(date, dateKeywordsSplit[2] .split ( "_") [0 ], 
                        Long.valueOf (dateKeywordsSplit [ 2] .split ( "_") [. 1 ]))); 

                Rows.Add (RowFactory.create (DATE, dateKeywordsSplit [ . 3]. Split ( "_") [0 ], 
                        Long.valueOf (dateKeywordsSplit [ . 3] .split ( "_") [. 1 ]))); 

                return rows; 
            } 
        }); 
        
        // the final data is converted to DataFrame, Hive table and saved to 
        DataFrame finalDF = sqlContext.createDataFrame (resultRDD, StructType);
 //         List <Row> rows1 = finalDF.javaRDD () the collect ();.
 //         for (Row Row: rows1) {
 //             the System. out.
println(row);
//        }
        finalDF.saveAsTable("daily_top3_keyword_uv");

        jsc.close();

    }
}

Guess you like

Origin www.cnblogs.com/weiyiming007/p/11319744.html