problem:
Hive query:
Spark SQL query
- The same table, the result of the query is different
- The first row of the table queried by spark sql is the header of the source data. As for why there are null values, it is because the fields where they are located are all set to int, which does not match.
the reason:
1. The reason why the header or dirty data does not appear in the Hive table is that I skipped the first row of the file when I created the table.
create table trains(
order_id int
,product_id int
,add_to_cart_order int
,reordered int
)
row format delimited fields terminated by ','
lines terminated by '\n'
--跳过文件行第一1行
tblproperties("skip.header.line.count"="1");
2、
-
Hive can ignore the first line by adding: tblproperties("skip.header.line.count"="1") statement when creating the table.
-
But the ignore header set in Hive does not take effect in Spark! This is the reason
solve
Solution 1:
\quad \quad When the Hive table is created, the dirty data is cleaned up through the shell command before loading the csv data.
- Before loading the data, remove the first line of the original data, and output the rest of the data to a new file, and then we load the table with the data of the new file
sed '1d' tmp.csv > tmp_res.csv
Solution 2:
\quad \quad On the basis of the original table, a backup table is created. Based on this backup table, read and write operations are performed through Spark Sql.
create table if not exists orders_2
row format delimited fields terminated by ","
as
select * from orders;
verification:
Hive query:
select * from orders_2 limit 5;
Spark SQL query:
import spark.sql
sql("select * from test.orders_2 limit 10").show