Some considerations based on Doris real-time data development

0d9865ddb3854979f0f8aa6ee5eabce9.png3 million words! The most complete big data learning interview community on the whole network is waiting for you!

The recent development of Doris is obvious to all. New features such as hot and cold separation continue to be added. This makes Doris greatly improved in terms of ease of use and cost.

Some storage real-time data warehouses based on Doris have begun to have some practice in more and more scenarios. You have also seen that this kind of scheme frequently appears in community sharing. But we have to look at this solution objectively. The storage-based real-time data warehouse has its advantages and disadvantages. In the production environment, we must carefully evaluate our personal business scenarios. In this article, I will briefly talk about this issue based on my personal practice and thinking. .

Why is there such a scheme?

In many cases, the real-time computing business based on OLAP such as Doris is based on the following considerations.

In more cases, the difficulty of real-time data development based on Flink is significantly higher than that of offline tasks (the two are not in the same order of magnitude). The development of real-time data storage based on Doris can significantly reduce the development threshold, but there is a possibility of abuse.

Secondly, Flink is not good at large windows, large states, and flexible computing scenarios (note that it is not good at this, not impossible). High, but Doris can significantly reduce this.

Finally, the observability of Flink-based computing data is poor. For example, the state data is invisible. There are significant thresholds for troubleshooting and debugging, and it is very difficult to repair historical data.

So you can see that there are not small thresholds for the above-mentioned real-time data development based on Flink. So we have a qualitative conclusion, below the scale of 100 million (or tens of millions) of data, we can use an analysis engine like Doris to perform layering and timing scheduling like offline data, and process large window data (general time span More than 30 days), under the premise of ensuring performance, the development cost of real-time data is reduced, and the observability of data is greatly improved, and the efficiency of development and operation and maintenance is also improved to a certain extent.

Compared with some Flink-based solutions

  1. Low threshold, simple development

Anyone can develop such tasks;

  1. Simple operation and maintenance

Because it does not consider state compatibility like Flink, it does not require a large amount of resources to be occupied for a long time. Only need to schedule resources when running SQL;

  1. Improve development efficiency

There is no need to have a deep understanding of Flink (of course this is not a good thing), there are almost no parameter bars, the test is simple, and there is no need to start the scheduling container (such as the scheduling of TaskManager and Task);

  1. Data debugging is convenient, and the intermediate results can be seen on the ground

There is no state data of Flink, all data is available in the table.

The above points are some advantages, but this solution based on Doris also has obvious shortcomings, which require special attention!

  1. Obvious delay

If you use Doris, then we most likely cooperate with scheduled scheduling. Generally, the scheduling cycle is above 30 seconds, which means that the real-time performance of data is greatly reduced, and some real-time observation indicators such as real-time GMV, online number and other scenarios are not applicable;

  1. data size limit

If you adopt Doris, it means that your TPS cannot be too high. This is not the field that Doris is good at, and everyone needs to pay special attention. In addition, the data of a single scan cannot be too large. As we said earlier, the performance guarantee is better only when the data size is below 100 million (or tens of millions).

Finally, if you really choose Doris-based real-time data development, it means that Doris will become your cost and operation and maintenance center. There must be very strict supporting tools, such as alarm, task operation monitoring, task standardization, scheduling and blood relationship capabilities. Pay special attention to resource and SQL performance issues. Once they become bottlenecks, they will affect all Doris-based tasks.

If this article is helpful to you, don't forget to  "Like",  "Like",  and "Favorite"  three times!

9a10cade8a82d07c74b91f16793e68b3.png

7f802109212f992e9f57f09e46988a2b.jpeg

It will be released on the whole network in 2022 | Big data expert-level skill model and learning guide (Shengtian Banzi)

The Internet's worst era may indeed be here

I am studying in university at Bilibili, majoring in big data

What are we learning when we are learning Flink?

193 articles beat Flink violently, you need to pay attention to this collection

Flink production environment TOP problems and optimization, Alibaba Tibetan Scripture Pavilion YYDS

Flink CDC I'm sure Jesus can't keep him! | Flink CDC online problem inventory

What are we learning when we are learning Spark?

Among all Spark modules, I would like to call SparkSQL the strongest!

Hard Gang Hive | 40,000-word Basic Tuning Interview Summary

A Small Encyclopedia of Data Governance Methodologies and Practices

A small guide to user portrait construction under the label system

40,000-word long text | ClickHouse basics & practice & tuning full perspective analysis

[Interview & Personal Growth] More than half of 2021, the experience of social recruitment and school recruitment

Another decade begins in the direction of big data | The first edition of "Hard Gang Series" ends

Articles I have written about growth/interview/career advancement

What are we learning when we are learning Hive? "Hard Hive Sequel"

Guess you like

Origin blog.csdn.net/u013411339/article/details/132157790