44, windowing function and case

First, the windowing function

1 Overview

Spark 1.4 after .x version for Spark SQL and DataFrame introduces windowing function, such as the most classic, most popular, row_number (), 
allows us to achieve a logical grouping of taking topn. 


Case: statistics for each kind of product sales in the top 3


Second, the group taking topN Case

1, java achieve

package cn.spark.study.sql;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.hive.HiveContext;

/**
 * row_number()开窗函数实战
 * @author Administrator
 *
 */
public class RowNumberWindowFunction {

    @SuppressWarnings("deprecation")
    public static void main(String[] args) {
        SparkConf conf = new SparkConf()
                .setAppName("RowNumberWindowFunction");
        JavaSparkContext sc = new JavaSparkContext(conf);
        HiveContext hiveContext = new HiveContext(sc.sc());
        
        // 创建销售额表,sales表
        hiveContext.sql("DROP TABLE IF EXISTS sales");
        hiveContext.sql("CREATE TABLE IF NOT EXISTS sales ("
                + "product STRING,"
                + "category STRING,"
                + "revenue BIGINT)");  
        hiveContext.sql("LOAD DATA "
                + "LOCAL INPATH '/usr/local/spark-study/resources/sales.txt' "
                + "INTO TABLE sales"); 
        
        // start writing our statistical logic, use row_number () windowing function
         // first explain the role row_number () windowing function
         // in fact, is to give each packet of data, according to their sort order, marked a line number within the packet
         // for example, a packet date = 20151001, there are three data, 1122,1121,1124,
         // so use, each line ROW_NUMBER () is a windowing function on this packet, three lines sequentially get the line number within a group
         // row number incremented from one, for example. 3 2,1124 1,1121 1122
         @ from sub-queries: query result as an inner layer of a temporary table for the outer query; 
        
        DataFrame top3SalesDF = hiveContext.sql ( "" 
                + "the SELECT Product, category, Revenue" 
                + "the FROM (" 
                    + "the SELECT" 
                        + "Product,"  
                        + "category,"
                        + "revenue,"
                        // row_number () syntax description windowing function
                         // First of all you can, when the SELECT query, use row_number () function
                         // Second, row_number () function to keep up behind OVER keyword
                         // and then in parentheses, is the PARTITION BY , that are grouped according to which field
                         // followed by the group can be sorted bY with the ORDER
                         // then row_number () can give the line within each group, a group of experts number
                         // tmp_sales: alias sub-queries 
                        + "ROW_NUMBER () the OVER (the PARTITION BY the ORDER BY category Revenue DESC) Rank" 
                    + "the FROM Sales"   
                + ") tmp_sales" 
                + "the WHERE Rank <=. 3" );
        
        // the data before each rank 3, saved to a table
        hiveContext.sql("DROP TABLE IF EXISTS top3_sales");  
        top3SalesDF.saveAsTable("top3_sales");  
        
        sc.close();
    }
    
}

Guess you like

Origin www.cnblogs.com/weiyiming007/p/11307487.html