Input and output formats commonly used Hadoop learning summary

purpose

To summarize the commonly used input and output formats.

Input Format

Hadoop can handle many different types of input formats, from general text files into the database.

Start a UML class diagram, class inheritance relationship with the respective common InputFormat important process (omitted portion overloaded) covers.

DBInputFormat

  • DBInputFormat, An input format for processing input database. KEY is LongWritable format indicates the number of records included; DBWritable format of VALUE is required to inherit the table according to their structure, to achieve DBWritable.
  • Which are required for use by the input method setInput class, the table name, field set, the other set of overloaded methods query and sorting conditions, or use of the specified input setInput classes directly, the SQL query, the number of statistical data SQL query.
  • The method which is based on the type of database createDBRecordReader Configuration of returns corresponding to the RecordReader, such OracleDBRecordReader, MySQLDBRecordReader.
  • Which logic slice, the number of mapper If specified, the specified amount of data is transferred as a number of query aliquot mapper (the last I out of a closed system section), if not specified mapper number, a default slice.
  • Derived therefrom is DataDrivenDBInputFormat, suggests that it is a data-driven database input format, in that the difference between DBInputFormat, DataDrivenDBInputFormat angle data from slice to do control, a column designated as reference a boundary (setBoundingQuery), divided by the number of mapper Fragmentation.

FileInputFormat

  • FileInputFormat, all file types for the parent class format. To achieve a more general method, such as getSplits, isSplitable, listStatus.
  • The default file can be sliced
  • Default hidden files (file name or '' at the beginning of the file '_') ignores the input directory
  • The default mode is fragmented in blocks slice, and slice count overflow value of 1.1.

TextInputFormat

  • TextInputFormat, FileInputFormat of <LongWritable, Text> subclass, the number of the current row byte offset key, the current row of content to value.
  • IsSplitable overloaded method, compression method determination method of determining whether a current file is used by the input file splitting support extension.
  • CreateRecordReader methods to achieve their specific logic LineRecordReader.

KeyValueTextInputFormat

  • KeyValueTextInputFormat, FileInputFormat the <Text, Text> subclass, the contents of the left side separator current line is content key, contents of the current row delimiter right content to value.
  • Mapreduce.input.keyvaluelinerecordreader.key.value.separator by custom delimiter attribute, default delimiter is a tab (\ t).
  • If the line defined by the tab is not present, then the Key for the whole line, Value is empty.
  • IsSplitable overloaded method, compression method determination method of determining whether a current file is used by the input file splitting support extension.
  • CreateRecordReader methods to achieve their specific logic SplittableCompressionCodec.

NLineInputFormat

  • NLineInputFormat, FileInputFormat of <LongWritable, Text> subclass, the number of the current row byte offset key, the current row of content to value.
  • Fragmented way, by reading the file line by line N line input as a slice (when there is a large amount of input, this step would not be very inefficient?!)
  • N the number of rows or the configuration which calls a method provided by the attribute setNumLinesPerSplit mapreduce.input.lineinputformat.linespermap.
  • CreateRecordReader methods to achieve their specific logic LineRecordReader.

SequenceFileInputFormat

  • SequenceFileInputFormat, a subclass of FileInputFormat for SequenceFile.
  • GetFormatMinSplitSize overloaded method that returns 100k.
  • Overloaded listStatus way to achieve the look SequenceFile directory (MapFile).
  • Typically there are two subclasses: SequenceFileAsTextInputFormat and SequenceFileAsBinaryInputFormat. Similar to this former category is the parent class of the <Text, Text> form; the latter is the superclass <BytesWritable, BytesWritable> form.

CombineFileInputFormat

  • CombineFileInputFormat, the virtual subclass FileInputFormat, can input a plurality of files into a slice, used to process the input as a number of small files. Existing implementation and the subclasses CombineTextInputFormat CombineSequenceFileInputFormat, respectively, plain text files and for processing the input SequenceFile.
  • The three variables relating to fragmentation maxSplitSize, minSplitSizeNode, minSplitSizeRack, must satisfy the relationship maxSplitSize> = minSplitSizeRack> = minSplitSizeNode.
  • Slice logic: Path filter press has been set, respectively, corresponding to each filtered input file pool, and then the respective input file pool do slices.
  • Slicing principle priority node-local> rack-local> internet, i.e., all the blocks in the same slice, is preferentially located in the same data node, followed in the same chassis, a plurality of racks again.
  • Pool for a file fragment specific practices: 1) After all the blocks on the same node summary, according do maxSplitSize segmentation, segmentation is not perfect until the residual, or less than the last remaining minSplitSizeNode a "node tail"; 2) Press 1) after processing all the nodes in the practice of the same rack, the sum of all "nodes tail", continue to maxSplitSize segmentation, segmentation or until the remaining less than perfect minSplitSizeRack the "rack tail"; 3) by 1) and 2) after practice in all the racks have been processed, the sum of all "rack tail", continue to maxSplitSize segmentation until the end, leaving the tail.

 

Output Format

Similar to the input format, Hadoop respectively in the corresponding output format. FIG class common output format is as follows:

DBOutputFormat

It outputs the result to a database table. SetOutput setting information may be the name of the output table and the like by the static method.

NullOutputFormat

OutputFormat empty achieve, namely to achieve without any output.

LazyOutputFormat

Lazy output format, when only the real output produced only create the output file.

 FileOutputFormat

 Virtual parent file type output to achieve a set / get the method of compression format, check the output directory, set / get output path.

TextOutputFormat

Writes output to the output format of ordinary text file, each record is written to it (button \ t value) composed of the text line.

SequenceFileOutputFormat

Sequentially writes the output file SequenceFile, subclasses SequenceFileAsBinaryOutputFormat is dedicated to the key / value pairs into binary format as write SequenceFile vessel.

 

Guess you like

Origin www.cnblogs.com/duanzi6/p/11599521.html