spark sql exercise


Data acquisition: Link: https://pan.baidu.com/s/1XcHKF50aEHrB_hPRfTb0vQ Extraction code: 76nd

1. Environmental requirements

  • Hadoop+Hive+Spark+HBase development environment.

2. Data description

1. Data background

  • The data is collected and summarized daily. The data covers 180+ large agricultural product wholesale markets in major provinces (Hong Kong, Macao, Taiwan, Tibet, and Hainan), and 380+ agricultural product categories (due to seasonal and regional characteristics, the daily data may not necessarily include All agricultural products categories).

2. Data type

  • Wholesale market price data of agricultural products products.txt
Chinese name English name type of data
Agricultural product name (column 1) name String
Wholesale price (column 2) price Double
Acquisition time (column 3) craw_time String
Wholesale market name (column 4) market String
Province (column 5) province String
City (Column 6) city String

3. Functional requirements (required to use RDD and Spark SQL respectively)

1. Statistics on the number of agricultural products markets

  • 1) Count the total number of agricultural product markets in each province
    Insert picture description here

  • 2) Statistics on the provinces without agricultural products market
    Insert picture description here

2. Statistics on types of agricultural products

  • 1) According to the number of types of agricultural products, count the top 3 provinces
    Insert picture description here

  • 2) Count the top 3 agricultural product markets in each province according to the number of types of agricultural products
    Insert picture description here

3. Price interval statistics, calculate the price fluctuation trend of each agricultural product in Shanxi Province, that is, calculate the average daily price, and output the result to the console

  • The formula for calculating the average price of a certain agricultural product:
  • PAVG = (PM1+PM2+…+PMn-max§-min§)/(N-2)
  • Among them, P is the price, and Mn is the market, which is the agricultural product market. PM1 represents the price of the product in the M1 agricultural product market, max§ represents the maximum price, and min§ the minimum price.

Insert picture description here

case class Product(productName:String,price:String,craw_time:String,market:String,province:String,city:String)
//转化rdd
val rdd = sc.textFile("file:///data/products.txt")
//定义schema信息
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{
    
    StructType, StructField, StringType}
val schemaString = "productName price craw_time market province city"
val fields = schemaString.split("\\s+").map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
val rowRDD = rdd.map(_.split("\\s+")).filter(_.size==6).map(x => Row(x(0), x(1),x(2),x(3), x(4),x(5)))
val pdf = spark.createDataFrame(rowRDD, schema)
pdf.printSchema()
pdf.createOrReplaceTempView("products")
//1、农产品市场个数统计
//1)统计每个省份的农产品市场总数
//sql
spark.sql("select province,count(distinct market) from products group by province").show
//rdd
rdd.map(_.split("\\s+")).filter(_.size==6).map(x=>(x(4),x(3))).groupByKey.map(x=>(x._1,x._2.toArray.distinct.size)).collect
//2)统计没有农产品市场的省份有哪些
case class Province(province:String,nickName:String)
val pdf2 = sc.textFile("file:///data/allprovinces.txt").map(_.split("\\s+")).map(x=>Province(x(0),x(1))).toDF
pdf2.createOrReplaceTempView("province")
spark.sql("with t1 as (select p1.province,p1.nickName,p2.market from province p1 left join products p2 on p1.province=p2.province) select * from t1 where t1.market is null ").show()
//2、农产品种类统计
//1)根据农产品类型数量,统计排名前 3 名的省份
spark.sql("select province,count(distinct productName) c from products group by province order by c desc").show(3)
//2)根据农产品类型数量,统计每个省份排名前 3 名的农产品市场
spark.sql("with t1 as(select province,market,count(distinct productName) c from products group by province,market) ,t2 as(select t1.*,row_number() over(partition by province order by t1.c desc) as rank from t1) select t2.* from t2 where t2.rank<=3").show(10000)
/*
3、价格区间统计,计算山西省每种农产品的价格波动趋势,即计算每天价格均值,并将
结果输出到控制台上。
某种农产品的价格均值计算公式:
PAVG = (PM1+PM2+...+PMn-max(P)-min(P))/(N-2)
其中,P 表示价格,Mn 表示 market,即农产品市场。PM1 表示 M1 农产品市场的该
产品价格,max(P)表示价格最大值,min(P)价格最小值。
*/
spark.sql("select productName,if(count(1)>2,round((sum(price)-max(price)-min(price))/(count(1)-2)),round(sum(price)/count(1))) as avgPriceDay from products where province = '山西' group by productName").show(10000)

Guess you like

Origin blog.csdn.net/sun_0128/article/details/107964128