02-Hive / Hadoop data storage format file created with avro hive table

Hive / Hadoop data storage format file created with avro hive table


A storage format
1.Hadoop file format
. 1> .SequenceFile
   SequenceFile is a binary file Hadoop API provided data will <key, value> into the form of a sequence of file. This binary file using the internal standard of Hadoop Writable interface serialization and deserialization. It is MapFile with Hadoop API is compatible with each other. Hive in SequenceFile SequenceFile inherited from Hadoop API, but its key is empty, the value stored actual value, so in order to avoid running MR sorting process map stage. If you write SequenceFile use Java API, and let the Hive read it, be sure to use the value stored field data, or you need to customize the reading InputFormat class and OutputFormat class of this SequenceFile.
2> .RCFile
   RCFile Hive is introduced a special column-oriented data format. It follows the "press row division, and then vertically divided" design philosophy. When the query process, it does not care for when a column, it will skip those listed on the IO. Incidentally, in the map RCFile stage from the distal end of the copy is a copy of the entire data block remains, and a copy of the local directory to RCFile not really skip unneeded columns, and the column jump to be read, but by each row group of the scanning head defined to achieve, but the level of HDFS Block entire head does not define each column starting from the end which row group which row group. Therefore, in the case of reading all the columns, RCFile performance but without high SequenceFile.
3> .Avro
  Avro is a data-intensive support for binary file format. It's more compact file format, to read large amounts of data, Avro can provide better serialization and de-serialization performance. Avro was born and data files with Schema definition, it does not require developers to implement their own Writable objects in the API level. More recently Hadoop subprojects Avro supported data formats, such as Pig, Hive, Flume, Sqoop and Hcatalog.
4>. Text
  other than three kinds of the above-mentioned binary format, text format data Hadoop is often encountered. As TextFile, XML and JSON. In addition to text format takes up more disk resources, it is generally higher than the overhead of parsing binary format than a few times, especially XML and JSON, they are bigger than the overhead of parsing Textfile, it is strongly not recommended in a production system these formats used for storage. If you need to output these formats, the client do the appropriate conversion. Text format often used for log collection, database import, Hive default configuration is to use text format, and often easy to forget compression, so be sure to use the correct format. Another disadvantage is that text format and it does not have the type of model, such as the amount of sales, profits or numerical data such date and time data types, if saved in text format, due to their own different lengths of string type, or comprising negative, resulting in MR is no way to sort, so often they need to be pretreated to contain binary format mode, which in turn led to the overhead and unnecessary waste of storage resources of pretreatment steps.
  5> External format
  Hadoop supports virtually any file format as long as possible to realize the RecordWriter and RecordReader can be. Database format which is often stored in Hadoop, such as Hbase, Mysql, Cassandra, MongoDB. These formats are typically large amounts of data in order to avoid rapid movement and loading with the needs. They serialization and deserialization is completed by the client database format, and the file storage position and the data layout (Data Layout) Hadoop help control their document segmentation block size nor by the HDFS (BLOCKSIZE) cut.
2.hive file format
    TEXTFILE // text, the default value
    SEQUENCEFILE // binary sequence file
    began to support RCFILE // row storage format Hive0.6 after
    ORC // row storage format than RCFILE higher compression ratio and read-write efficiency, Hive0.11 later began to support
    PARQUET // list the file storage format, Hive0.13 later began to support


.Avro two hive file to load data into the table
    1. The table needs to be created schema structure .avro hive file generated according .avsc file, the source data may be manually created
        (generated file to be downloaded .avsc avro-tools-1.8 .2.jar, by java -jar avro-tools-1.8.2.jar installed, executed java -jar avro-tools-1.8.2.jar getschema xxxx / part-m-00000.avro> ~ / xxx.avsc file )

     Download Link http://apache.mirrors.tds.net/avro/avro-1.8.2/java/avro-tools-1.8.2.jar
    2. Create a table to create a hive file to load data from .avsc
        CREATE EXTERNAL TABLE library name table name   
        the COMMENT "A table backed by Avro Avro Schema The Data Stored in with the HDFS"
        the ROW the FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
        STORED the AS
        InputFormat 'org.apache.hadoop.hive.ql .io.avro.AvroContainerInputFormat '
        OUTPUTFORMAT' org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat '
        LOCATION' / Data / db_biz / History / Item '- hdfs corresponding to the file directory ()
        TBLPROPERTIES (
            ' avro.schema .url '=' hdfs: ///data/orders.avsc ' - Specifies avsc file created table structure
        );
    3. Upload the file to the specified directory avro
        FS -put /home/hadoop/xxx.avro HDFS Hadoop: // Hadoop: 90000 / Data / db_biz / History / Item /
    
    4. query, verification
        table structure: describe analysis.ods_Item;
        table data: select count (*) from analysis.ods_Item;

 

Guess you like

Origin blog.csdn.net/qq_35281775/article/details/89853219