flink之addSource & fromSource 、addSink & SinkTo

One, addSource & fromSource, addSink & SinkTo
       The difference between these two sets of operators is that addSource and addSink need to implement SourceFunction or SinkFunction by themselves, in which the logic of reading data, fault tolerance, etc. need to be implemented by themselves; fromSource and SinkTo are simple reading and output operators provided by flink. , it is recommended to use fromSource and SinkTo first, and combine it with flink official documentation;
2. filesystem source operator
1.readTextFile ( filePath: String, charsetName: String ): The underlying call is readFile (format , filePath , FileProcessingMode. PROCESS_ONCE , - 1 , typeInfo)
2.readFile( FileInputFormat< OUT > inputFormat , String filePath , FileProcessingMode watchType , long interval , FilePathFilter filter
①FileInputFormat<OUT> inputFormat: Define the class for reading files. You can see which implementation classes are available. It can be determined according to the type of file you need to read, or you can customize it;
②String filePath: file path
③FileProcessingMode watchType: Define the mode for reading files, read once: FileProcessingMode. PROCESS_ONCE , read multiple times: FileProcessingMode . PROCESS_CONTINUOUSLY
④long interval: the time interval for reading files, the batch mode is set to -1, the unit is milliseconds
⑤FilePathFilter filter: Filter files, marked @Deprecated
The underlying source code parallelism of the above two sources is 1, and both are SourceFunction passed in by calling addSource; as a digression, before 1.14, flink Kafka used addSource, which implemented ParalismSourceFunction and some fault-tolerant classes. The fromSource adopted after the release of 1.14 uses the architecture of splits (Splits) , split enumerator (SplitEnumerator)  and   source reader (SourceReader)
3.FileSource
A · batch mode
val fileSource: FileSource[String] = FileSource
. forBulkFileFormat ()
.build()
B.stream mode
rely:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-files</artifactId>
<version>1.12.5</version>
</dependency>
Code:
val fileSource: FileSource[String] = FileSource
. forRecordStreamFormat (new TextLineFormat(), new Path(""))
.monitorContinuously(Duration. ofSeconds (1000))
.build()
This method can read files in parallel and is also fault-tolerant. However, please see the official flink code for the specific logic. It should be noted that if you are using TextLineFormat, because the isSplittable method of its parent class SimpleStreamFormat returns false, so Even if fromSource has multiple degrees of parallelism, it still cannot read files in parallel; however, it can but can rewrite StreamFormat. The following is a simple implementation based on TextLineFormat.
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.connector.file.src.reader.StreamFormat;
import org.apache.flink.connector.file.src.reader.TextLineFormat;
import org.apache.flink.core.fs.FSDataInputStream;

import javax.annotation.Nullable;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;


public class MyStreamFormat implements StreamFormat<String> {
    private static final long serialVersionUID = 1L;

    public static final String DEFAULT_CHARSET_NAME = "UTF-8";

    private final String charsetName;

    public MyStreamFormat() {
        this(DEFAULT_CHARSET_NAME);
    }

    public MyStreamFormat(String charsetName) {
        this.charsetName = charsetName;
    }

    @Override
    public Reader createReader(Configuration config, FSDataInputStream stream, long fileLen, long splitEnd) throws IOException {
        final BufferedReader reader =
                new BufferedReader(new InputStreamReader(stream, charsetName));
        return new MyStreamFormat.Reader(reader);
    }

    @Override
    public Reader restoreReader(Configuration config, FSDataInputStream stream, long restoredOffset, long fileLen, long splitEnd) throws IOException {
        stream.seek(restoredOffset);
        return createReader(config, stream,fileLen,splitEnd);
    }

    @Override
    public boolean isSplittable() {
        return true;
    }

    @Override
    public TypeInformation<String> getProducedType() {
        return Types.STRING;
    }

    public static final class Reader implements StreamFormat.Reader<String> {

        private final BufferedReader reader;

        Reader(final BufferedReader reader) {
            this.reader = reader;
        }

        @Nullable
        @Override
        public String read() throws IOException {
            return reader.readLine();
        }

        @Override
        public void close() throws IOException {
            reader.close();
        }
    }
}

2. filesystem sink operator
StreamFileSink, FileSink after version 1.14, here is the former as an example
Two writing modes: forRowFormat, forBulkFormat
1.Construction of forRowFormat, forBulkFormat
①forRowFormat(final Path basePath, final Encoder<IN> encoder)
In line mode, the customized content is limited to the inside of the file, and it is difficult to compress the file and other operations;
//This class has only one method
public interface Encoder<IN> extends Serializable {
    void encode(IN element, OutputStream stream) throws IOException;
}
②forBulkFormat( final Path basePath, final BulkWriter.Factory<IN> bulkWriterFactory)
In column mode, not only can you perform internal operations on files, but you can also easily perform file compression and other operations;
public class Mybucket<T> implements BulkWriter<T>{
   
    FSDataOutputStream fsDataOutputStream=null;
    GzipCompressorOutputStream gzout=null;
    @Override
    //Each piece of data is written to the file through the stream
    public void addElement(T element) throws IOException {
         gzout.write(element.toString().getBytes());
    }
    //Refresh the stream. If you consider efficiency issues, you don’t need to refresh it here. You can also refresh it in the flish() method.
    @Override
    public void flush() throws IOException {
         gzout.flush;
    }
    //Close the stream
    // Note: This method cannot close the stream passed in by Factory, this is done by the framework! ! !
    @Override
    public void finish() throws IOException {
         gzout.close;
    }
    
    //Create a writer
    // It’s okay to write the class separately here. It’s nothing more than one more constructor method for the current external class.
    class MyFactory implements Factory<T>{
            @Override
            public Mybucket<T> create(FSDataOutputStream out) throws IOException {
                fsDataOutputStream=out;
                GzipCompressorOutputStream gzipOutput = new GzipCompressorOutputStream(output);
                return Mybucket.this;
            }}}
2.withBucketAssigner():
①Explanation: Specify the bucketing strategy. The so-called bucket refers to which folder the data should go to; row mode and column mode are common;
②Parameter: BucketAssigner, which has three implementation classes:
BasePathBucketAssigner:
//Generate files directly under the given path, no folder will be generated
public String getBucketId(T element, BucketAssigner.Context context) {return "";}
DateTimeBucketAssigner:
//Date format (i.e. bucket size) and time zone can be configured manually. (See its construction method)
//Generate folder: yyyy-MM-dd--HH, so the bucket is rolled by hour
public String getBucketId(IN element, BucketAssigner.Context context) {
        if (dateTimeFormatter == null) {
            dateTimeFormatter = DateTimeFormatter.ofPattern(formatString).withZone(zoneId);
        }
        return dateTimeFormatter.format(Instant.ofEpochMilli(context.currentProcessingTime()));
    }
Custom BucketAssigner: You need to decide which one to inherit according to your own business logic. If you only bucket by data, you can directly inherit BasePathBucketAssigner and override getBucketId. If you need to customize time, you can inherit DateTimeBucketAssigner and override getBucketId. If the business logic is further complicated, then Just rewrite BucketAssigner. Here is a brief introduction to the two methods of BucketAssigner:
//Determines which folder each piece of data should go to. If there is no such folder, it will be automatically created. The final data writing path is the return value of this method + the given path.
//context can get some time
BucketID getBucketId(IN element, BucketAssigner.Context context);
//Serialize/deserialize getBucketId return value
SimpleVersionedSerializer<BucketID> getSerializer();
3. withRollingPolicy:
①Explanation: RollingPolicy  defines when to close a given In-progress Part file, convert it to the Pending state, and then convert it to the Finished state. Files in Finished status are available for viewing and ensure data validity, and will not be restored in the event of a failure. In   STREAMING  mode, the rolling strategy combined with the Checkpoint interval (the Pending status of the file is converted to the Finished status until the next Checkpoint is successful) jointly controls whether Part files are visible to downstream readers and the size and number of these files. In   BATCH  mode, the Part file becomes visible to the downstream at the end of the Job, and the rolling strategy only controls the maximum Part file size. The column mode ( forBulkFormat ) can only use CheckpointRollingPolicy ;
②Parameters
RollingPolicy (parent class of all RollingPolicy)
public interface RollingPolicy<IN, BucketID> extends Serializable {
    //Determines if the in-progress part file for a bucket should roll on every checkpoint.(if true in-progress move to pending)
    //The change of file from pending state to finished state has nothing to do with this
    boolean shouldRollOnCheckpoint(final PartFileInfo<BucketID> partFileState) throws IOException;
    //Determines if the in-progress part file for a bucket should roll based on its current state, e.g. its size.(if true in-progress move to pending)
    boolean shouldRollOnEvent(final PartFileInfo<BucketID> partFileState, IN element)
            throws IOException;
    // Determines if the in-progress part file for a bucket should roll based on a time condition.(if true in-progress move to pending)
    boolean shouldRollOnProcessingTime(
            final PartFileInfo<BucketID> partFileState, final long currentTime) throws IOException;
}
CheckpointRollingPolicy
Although column mode can only use the abstract class CheckpointRollingPolicy (it is an implementation of RollingPolicy, overriding shouldRollOnCheckpoint and returns true ), CheckpointRollingPolicy has only one subclass, OnCheckpointRollingPolicy ( shouldRollOnEvent and shouldRollOnProcessingTime in this class return false ); if in column mode, If you don’t want to roll the file according to the checkpoint, you can try inheriting CheckpointRollingPolicy and rewriting shouldRollOnCheckpoint to return false;
DefaultRollingPolicy, using builder architecture
DefaultRollingPolicy.builder()
          //Set the maximum opening time of a file, and scroll if the time exceeds
          .withRolloverInterval(Duration.ofMinutes(15))
          //Set the time when no data is written to a file. If the time exceeds, it will scroll.
          .withInactivityInterval(Duration.ofMinutes(5))
          //Set the maximum capacity of a file. If the capacity is exceeded, it will be scrolled.
          .withMaxPartSize(MemorySize.ofMebiBytes(1024))
          .build()
4. withOutputFileConfig:
①Explanation: Add prefix and suffix to the generated file; row mode and column mode are common;
.withOutputFileConfig(OutputFileConfig
      .builder()
      .withPartPrefix("gouba-")
      .withPartSuffix(".gz")
      .build())
5. enableCompact:
①Explanation: The finished files generated by merging; row mode and column mode are common;
②参数:FileCompactStrategy,FileCompactor
FileCompactStrategy
FileCompactStrategy.Builder.newBuilder()
                  //Make a merge every few checkpoints, default 1
          .enableCompactionOnCheckpoint(3)
         //How many threads to use to merge files, default 1
          .setNumCompactThreads(4)
          .build()
FileCompactor
※ IdenticalFileCompactor: Copy the contents of one file directly to another file. Only one file can be copied at a time.
※ ConcatFileCompactor: You can customize the direct separator between two files, which is passed in by the construction method.
※ RecordWiseFileCompactor: more customized content
6. withBucketCheckInterval(mills)
Explanation: Check whether the in-progress file should be closed based on the RollingPolicy, starting from 0 seconds of system time.

Guess you like

Origin blog.csdn.net/m0_64640191/article/details/129859858