flink writes to hdfs

Maven dependencies

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-connector-filesystem_2.10</artifactId>
  <version>1.3.2</version>
</dependency>

Java code

DataStream<Tuple2<IntWritable,Text>> input = ...;

BucketingSink<String> sink = new BucketingSink<String>("/base/path");//如果跨集群要带上前缀,指定集群
sink.setBucketer(new DateTimeBucketer<String>("yyyy-MM-dd--HHmm"));
sink.setWriter(new StringWriter<>());
sink.setBatchSize(1024 * 1024 * 400); // this is 400 MB,

input.addSink(sink);

Mainly set three properties Bucketer, Writer, BatchSize.

Bucketer: The data is written into the hdfs directory. DateTimeBucketer is divided according to the current system time, and the specific granularity is determined according to the incoming parameters. Of course, we can also set our own division rules, and use the fields in the data to determine the division directory;

For example, I divide the directory according to the Timestamp field in the Json data:

class DateHourBucketer implements Bucketer<JSONObject>{
			private static final long serialVersionUID = 1L;
			private SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd--HH");
			@Override
			public Path getBucketPath(Clock clock, Path basePath, JSONObject element) {
				// TODO Auto-generated method stub
				Long timetamp = element.getLong("Timestamp");
				String newDateTimeString = format.format(new Date(timetamp));
				return new Path(basePath + "/" + newDateTimeString);
			}
			
		}

Writer: Data writing format, which is converted to string writing by default. If the data format is SequenceFiles, we can use SequenceFileWriter;

BatchSize: By default one part file per thread, batchsize specifies the size of the part file to generate a new file

Of course, we can still set the path prefix, suffix, how long to close the file handle and other attributes.

The default generated path format is as follows:

/base/path/{date-time}/part-{parallel-task}-{count}

count is the part file number set by BatchSize 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325007008&siteId=291194637