Solved Spark Sql reads and writes Hive tables (load source data in .csv format) data inconsistency

problem:

Hive query:
Insert picture description here
Spark SQL query

Insert picture description here

  • The same table, the result of the query is different
  • The first row of the table queried by spark sql is the header of the source data. As for why there are null values, it is because the fields where they are located are all set to int, which does not match.
    Insert picture description here

the reason:

1. The reason why the header or dirty data does not appear in the Hive table is that I skipped the first row of the file when I created the table.

create table trains(
order_id int
,product_id int
,add_to_cart_order int
,reordered int
)
row format delimited fields terminated by ','
lines terminated by '\n'
--跳过文件行第一1行
tblproperties("skip.header.line.count"="1");

2、

  • Hive can ignore the first line by adding: tblproperties("skip.header.line.count"="1") statement when creating the table.

  • But the ignore header set in Hive does not take effect in Spark! This is the reason

solve

Solution 1:

\quad \quad When the Hive table is created, the dirty data is cleaned up through the shell command before loading the csv data.

  • Before loading the data, remove the first line of the original data, and output the rest of the data to a new file, and then we load the table with the data of the new file
sed '1d' tmp.csv > tmp_res.csv

Solution 2:

\quad \quad On the basis of the original table, a backup table is created. Based on this backup table, read and write operations are performed through Spark Sql.

create table if not exists orders_2
row format delimited fields terminated by "," 
as 
select * from orders;

verification:

Hive query:

select * from orders_2 limit 5;

Insert picture description here

Spark SQL query:

import spark.sql
sql("select * from test.orders_2 limit 10").show

Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_45666566/article/details/112680738