SparkSQL based on Hadoop

1. Understand Spark Shell and Spark SQL mode

2. Learn to use Spark Shell and Spark SQL mode to create tables and query data

Experimental principle

The predecessor of Spark SQL is Shark. Shark is one of the components of the Spark ecosystem of Berkeley Lab. It can run on the Spark engine, thereby increasing the speed of SQL queries by 10-100 times. However, with the development of Spark, Because Shark relies too much on Hive (such as using Hive's syntax parser, query optimizer, etc.), it restricts Spark's established policy of One Stack Rule Them All and restricts the mutual integration of various components of Spark. Therefore, SparkSQL was proposed project.

SparkSQL abandoned the original Shark code and absorbed some of Shark's advantages, such as in-memory column storage (In-MemoryColumnarStorage), Hive compatibility, etc., and redeveloped the SparkSQL code; since it got rid of its dependence on Hive, SparkSQL can Compatibility, performance optimization, and component expansion have all been greatly facilitated.

(Spark SQL execution process)

The specific execution process of SQLContext is as follows:

(1) SQL | HQL statements are parsed into UnresolvedLogicalPlan through SqlParse.

(2) Use analyzer combined with data dictionary (catalog) for binding to generate resolvedLogicalPlan. In this process, catalog extracts SchemRDD, registers objects similar to case class, and then registers the table into memory.

(3) After the Analyzed Logical Plan is optimized by the Catalyst Optimizer optimizer, the Optimized Logical Plan is generated. After this process is completed, the following parts are completed in the Spark core.

(4) The results of the Optimized Logical Plan are sent to SparkPlanner, and then SparkPlanner processes it and sends it to PhysicalPlan. After this process, Spark Plan is generated.

(5) Use SparkPlan to convert LogicalPlan into PhysicalPlan.

(6) Use prepareForExecution() to convert PhysicalPlan into an executable physical plan.

(7) Use execute() to execute the executable physical plan.

(8) Generate DataFrame.

The entire running process involves multiple SparkSQL components, such as SqlParse, analyzer, optimizer, SparkPlan, etc.

lab environment

Linux Ubuntu 16.04

jdk-7u75-linux-x64

hadoop-2.6.0-cdh5.4.5

scala-2.10.5

spark-1.6.0-bin-hadoop2.6

Experiment content

An e-commerce platform needs to analyze order data. It is known that order data includes two files, namely order data orders and order detail data order_items. orders records the order ID, order number, user ID and order of the goods purchased by the user. date. order_items records the product ID, order ID and detail ID. Their structure and relationship are shown in the figure below:

Experimental steps

1. First check whether the Hadoop-related processes have been started. If it is not started, switch to the /apps/hadoop/sbin directory and start Hadoop.

view plain copy

  1. jps  
  2. cd /apps/hadoop/sbin  
  3. ./start-all.sh  

2. Create a new /data/spark5 directory locally on Linux.

view plaincopy 

  1. mkdir -p /data/spark5  

3. Switch to the /data/spark5 directory and use the wget command to download orders and order_items in http://172.16.103.12:60000/allfiles/spark5.

view plaincopy 

  1. cd /data/spark5  
  2. wget http://172.16.103.12:60000/allfiles/spark5/orders  
  3. wget http://172.16.103.12:60000/allfiles/spark5/order_items  

4. First, create a new /myspark5 directory on HDFS, and then upload the orders and order_items files in the /data/spark5 directory to the /myspark5 directory of HDFS.

view plaincopy 

  1. hadoop fs -mkdir /myspark5  
  2. hadoop fs -put /data/spark5/orders /myspark5  
  3. hadoop fs -put /data/spark5/order_items /myspark5  

5. Start Spark Shell.

view plaincopy 

  1. spark-shell  

6. Under spark-shell, use case class method to define RDD and create orders table.

view plaincopy 

  1. val sqlContext = new org.apache.spark.sql.SQLContext(sc)  
  2. import sqlContext.implicits._  
  3. case class Orders(order_id:String,order_number:String,buyer_id:String,create_dt:String)  
  4. val dforders = sc.textFile("/myspark5/orders").map(_.split('\t')).map(line=>Orders(line(0),line(1),line(2),line(3))).toDF()  
  5. dforders.registerTempTable("orders")  

Verify that the table was created successfully.

view plaincopy 

  1. sqlContext.sql("show tables").map(t=>"tableName is:"+t(0)).collect().foreach(println)  
  2. sqlContext.sql("select order_id,buyer_id from orders").collect  

7. Under Spark Shell, use applyScheme to define RDD and create the order_items table.

view plaincopy 

  1. import org.apache.spark.sql._  
  2. import org.apache.spark.sql.types._  
  3. val rddorder_items = sc.textFile("/myspark5/order_items")  
  4. val roworder_items = rddorder_items.map(_.split("\t")).map( p=>Row(p(0),p(1),p(2) ) )  
  5. val schemaorder_items = "item_id order_id goods_id"  
  6. val schema = StructType(schemaorder_items.split(" ").map(fieldName=>StructField(fieldName,StringType,true)) )  
  7. val dforder_items = sqlContext.applySchema(roworder_items, schema)  
  8. dforder_items.registerTempTable("order_items")  

Verify that table creation is successful.

view plain copy

  1. sqlContext.sql("show tables").map(t=>"tableName is:"+t(0)).collect().foreach(println)  
  2. sqlContext.sql("select order_id,goods_id from order_items ").collect  

8. Join the order table and the order_items table to collect statistics on which users of the e-commerce website purchased what products.

view plaincopy 

  1. sqlContext.sql("select orders.buyer_id, order_items.goods_id from order_items  join orders on order_items.order_id=orders.order_id ").collect  

9. Exit Spark shell mode

view plaincopy 

  1. exit  

The following demonstrates Spark SQL mode

10. Start Spark SQL.

view plaincopy 

  1. spark-sql  

11. Create the tables orders and order_items.

view plaincopy 

  1. create table orders (order_id string,order_number string,buyer_id string,create_dt string)  
  2. row format delimited fields terminated by '\t'  stored as textfile;  

view plaincopy 

  1. create table order_items(item_id string,order_id string,goods_id string)  
  2. row format delimited fields terminated by '\t'  stored as textfile;  

12. View the created table.

view plaincopy 

  1. show tables;  

The false after the table name means that the table is not a temporary table.

13. Load the data in the orders table and order_items table under /myspark5 in HDFS into the two tables just created.

view plaincopy 

  1. load data inpath '/myspark5/orders' into table orders;  
  2. load data inpath '/myspark5/order_items' into table order_items;  

14. Verify whether the data is loaded successfully.

view plaincopy 

  1. select * from orders;  
  2. select * from order_items; 

15. Process the file, join the order table and the order_items table, and make statistics on the e-commerce website to see which users purchased what products.

view plaincopy 

  1. select orders.buyer_id, order_items.goods_id from order_items join orders on order_items.order_id=orders.order_id;  

Guess you like

Origin blog.csdn.net/qq_63042830/article/details/135090238