Newline issue import data into the hive - pit mining large data sets

Problem Description

We usually migrate data to large data environment, use, etl tool sqoop, datax like migration from the database to the hive or hdfs. Text class data in the database in the original data will inevitably be special characters, such as line breaks, it will import the data Hive impact. The following is a hive import data from mysql, wherein, the table is stored in hive textfile format do not deal with special characters:
Here Insert Picture Description

There are two fields not found, so look at table hive file on the hdfs:
Here Insert Picture Description
discover the original data line, into a multi-line text has '\ n' line breaks, leading to confusion location, so check out the hive there are two fields did not

Solution one

Newline replacing an empty string, there Sqoop parameter substitution, but no datax. Some scenes in order to preserve the integrity of the data must be retained newline how to do it?

Solution two

Orc table storage hive with
ORC stands (Optimized Row Columnar), ORC file format is a column in the ecosystem Hadoop storage format, which is generated in early 2013, generated from the original Apache Hive, for reducing the data Hadoop Hive storage space and speed up query speed. Parquet and the like, it is not a simple columnar storage format, the entire table is still divided according to first group row, stored in columns for each row in the group. ORC is self-describing file, its metadata using Protocol Buffers serialized and data files as much as possible to reduce the consumption of compressed storage space, it is also being Spark SQL, Presto and other query engine supports
because it is columnar storage, so the reason there is no line breaks in a field of data lead to confusion

Table after the orc storage, full-table query:
Here Insert Picture Description
Huh? How empty? The following lines how all null? Columnar storage should not happen ah!
Do not worry, you see is not necessarily true
increase the filtering conditions specified id try this line
Here Insert Picture Description
Huh? How there has been?
Explanation: I use ambari own hive query interface, which may be bug ambari itself shows that the text data also shows line breaks inside out, so to see is chaos, but, on the page and the actual data no mess, you will not have to perform any query problem, the problem page display, data is no longer a problem! !

Released three original articles · won praise 0 · Views 3852

Guess you like

Origin blog.csdn.net/u013289115/article/details/85775699