Flink writes real-time application of Doris

Flink writes real-time application of Doris

If you want to have an in-depth exchange of doris's private chat with me, add WeChat

introduction

Students doing real-time data warehouses are very familiar with the currently popular KFC (Kafka\Flink\ClickHouse) package. In fact, KFD is also good.
Big data components are becoming more and more abundant, but there has not yet been a tool that is compatible with OLAP and OLTP, that is, to meet the real-time storage and complex query of DB and logs, and to meet the construction of data warehouses on this basis. We have tried ClickHouse. Disadvantages It is difficult to maintain and low real-time writing efficiency. It is difficult to realize real-time storage of large amounts of data with internal fragmentation and data migration. After using impala+kudu, the disadvantage is that impala is too memory-intensive, and the combination of the two is more laborious. A tool for real-time synchronization of DB was developed, and the maintenance cost was too high, so I gave up. Finally, referring to Baidu’s doris and job help information, I officially started to use Doris and realized the quasi-real-time log and DB (including sub-table merging) Synchronization, and data warehouse modeling based on doris.
Next, I will briefly talk about the implementation method for the real-time writing part of Doris, mainly code and comments

Table design

Do not design the field as "not null". The advantage is that changing the table later (adding fields) will not affect the normal data. Others are temporarily inconvenient to disclose. I will talk about it later

JSONStreamLoad

Why choose StreamLoad? At the beginning, insert into was used. Insert into used FE resources, which caused the FE to be busy, and there would be problems with the later data volume. However, streamload does not have this problem. The official said (the 0.12 document ):

Stream load 中,Doris 会选定一个节点作为 Coordinator 节点。该节点负责接数据并分发数据到其他数据节点。

用户通过 HTTP 协议提交导入命令。如果提交到 FE,则 FE 会通过 HTTP redirect 指令将请求转发给某一个 BE。用户也可以直接提交导入命令给某一指定 BE。

导入的最终结果由 Coordinator BE 返回给用户。

                         ^      +
                         |      |
                         |      | 1A. User submit load to FE
                         |      |
                         |   +--v-----------+
                         |   | FE           |
 - Return result to user |   +--+-----------+
                         |      |
                         |      | 2. Redirect to BE
                         |      |
                         |   +--v-----------+
                         +---+Coordinator BE| 1B. User submit load to BE
                             +-+-----+----+-+
                               |     |    |
                         +-----+     |    +-----+
                         |           |          | 3. Distrbute data
                         |           |          |
                       +-v-+       +-v-+      +-v-+
                       |BE |       |BE |      |BE |
                       +---+       +---+      +---+

After that, referring to Jingdong's practice, continuous loading of small files is realized, and real-time data insertion.
Step on the pit:

  • Try to have a large amount of data to avoid multiple submissions, and there will be thread occupancy problems
  • Load is based on DB, a DB defaults to 100 threads, and control the number of load threads
  • Load is very memory consuming, one is thread, the other is data merging
  • streaming_load_max_batch_size_mb defaults to 100, which can be changed according to the business
  • If you want to synchronize DB data, pay attention to multi-threaded execution of curl

The implementation is relatively simple, nothing more than embedding a piece of code to execute curl in the flinkSink code

## 原curl
curl --location-trusted -u 用户名:密码 -T /xxx/test -H "format: json" -H "strip_outer_array: true" http://doris_fe:8030/api/{
    
    database}/{
    
    table}/_stream_load
## -u 不用解释了,用户名和密码
## -T json文件的地址,内容为[json,json,json],就是jsonlist
## -H 指定参数
## http 指定库名和表名

Steps: generate temporary files createFile, write data into temporary files mappedFile, execute execCurl, delete temporary files deleteFile(simplified version)


    /**
     * 创建临时内存文件
     * @param fileName
     * @throws IOException
     */
    public static void createFile(String fileName) throws IOException {
    
    

        File testFile = new File(fileName);
        File fileParent = testFile.getParentFile();

        if (!fileParent.exists()) {
    
    
            fileParent.mkdirs();
        }
        if (!testFile.exists())
            testFile.createNewFile();
    }

    /**
     * 删除临时内存文件
     * @param fileName
     * @return
     */
    public static boolean deleteFile(String fileName) {
    
    
        boolean flag = false;
        File file = new File(fileName);
        // 路径为文件且不为空则进行删除
        if (file.isFile() && file.exists()) {
    
    
            file.delete();
            flag = true;
        }
        return flag;
    }
    
    /**
     * 写入内存文件
     * @param data
     * @param path
     */
    public static void mappedFile(String data, String path) {
    
    

        CharBuffer charBuffer = CharBuffer.wrap(data);

        try {
    
    
            FileChannel fileChannel = FileChannel.open(Paths.get(path), StandardOpenOption.READ, StandardOpenOption.WRITE,
                    StandardOpenOption.TRUNCATE_EXISTING);

            MappedByteBuffer mappedByteBuffer = fileChannel.map(FileChannel.MapMode.READ_WRITE, 0, data.getBytes().length*4);

            if (mappedByteBuffer != null) {
    
    
                mappedByteBuffer.clear();
                mappedByteBuffer.put(Charset.forName("UTF-8").encode(charBuffer));
            }
            fileChannel.close();
        } catch (IOException e) {
    
    
            e.printStackTrace();
        }

    }

    /**
     * 执行curl
     * @param curl
     * @return
     */
    public static String execCurl(String[] curl) {
    
    

        ProcessBuilder process = new ProcessBuilder(curl);
        Process p;
        try {
    
    
            p = process.start();
            BufferedReader reader = new BufferedReader(new InputStreamReader(p.getInputStream()));
            StringBuilder builder = new StringBuilder();
            String line = null;
            while ((line = reader.readLine()) != null) {
    
    
                builder.append(line);
                builder.append(System.getProperty("line.separator"));
            }
            return builder.toString();

        } catch (IOException e) {
    
    
            System.out.print("error");
            e.printStackTrace();
        }
        return null;

    }
    /**
     * 生成Culr
     * @param filePath
     * @param databases
     * @param table
     * @return
     */
    public static String[] createCurl(String filePath, String databases, String table){
    
    
        String[] curl = {
    
    "curl","--location-trusted", "-u", "用户名:密码", "-T",filePath, "-H","format: json", "-H", "strip_outer_array: true", "http://doris_fe:8030/api/"+databases+"/"+table+"/_stream_load"};
        
        return curl;
    }

substantial

It is relatively simple to implement a custom sink, here is a brief share of how I wrote it (simplified version).

class LogCurlSink(insertTimenterval:Long,
                  insertBatchSize:Int) extends RichSinkFunction[(String, Int, Long, String)] with Serializable{
    
    
  private val Logger = LoggerFactory.getLogger(this.getClass)
  private val mesList = new java.util.ArrayList[String]()
  private var lastInsertTime = 0L
  
  override def open(parameters: Configuration): Unit ={
    
    
    val path = s"/tmp/doris/{databases}/{table}/{ThreadId}"
    CurlUtils.createFile(path)
    Logger.warn(s"init and create $topic filePath!!!")
  }
  
  	// (topic,partition,offset,jsonstr)
   override def invoke(value: (String, Int, Long, String), context: SinkFunction.Context[_]): Unit = {
    
    
    if(mesList.size >= this.insertBatchSize || isTimeToDoInsert){
    
    
      //存入
      insertData(mesList)
      //此处可以进行受到维护offset
      mesList.clear()
      this.lastInsertTime = System.currentTimeMillis()
    }
    mesList.add(value._4)
  }
  
  
  override def close(): Unit = {
    
    
    val path = s"/tmp/doris/{databases}/{table}/{ThreadId}"
    CurlUtils.deleteFile(path)
    Logger.warn("close and delete filePath!!!")
  }

  /**
    * 执行插入操作
    * @param dataList
    */
  private def insertData(dataList: java.util.ArrayList[String]): Unit ={
    
    }
  /**
    * 根据时间判断是否插入数据
    *
    * @return
    */
  private def isTimeToDoInsert = {
    
    
    val currTime = System.currentTimeMillis
    currTime - this.lastInsertTime >= this.insertCkTimenterval
  }

}

Guess you like

Origin blog.csdn.net/jklcl/article/details/112851685