Real-time data warehouse|architecture design and technology selection


Preface

          When we are doing a project, we often need to choose what technology to use. This part is not what our ordinary employees think, but the architect will choose the appropriate technology according to the customer's needs. When choosing the right technology, our development will be more effective. Below I will explain 实时数仓how to select the project ( ) I made .5642401be2261b0de1594378813f2afa.jpg

One, technology selection

         When we choose technology, we need to choose according to customer needs. For example: 实时统计交易金额(要求延迟不能超过一秒)At this time, when we choose the technology, we cannot use those batch processing technologies such as Hive, MapRduceretc., because MapRducer may be started for more than one second, so these requirements cannot be met at all. At this time, we can consider using some real-time computing techniques such as Flink, SparkStreamingetc. Next, we will explain how to choose.

         Currently the market is a lot of real-time computing technologies such as: Spark streaming, Struct streaming, Storm, JStorm(阿里), Kafka Streaming, Flinkand many other technology stack that how do we choose?

         When we choose a technology, we need to consider it all. It's not that you want to use this technology if you like it. This is not a wise choice. According to general business 公司员工的技术基础, 流行, 技术复用, 场景and many other factors to choose. Attach a technical picture 6143c412fcc18b210d95a729f323de93.jpg         based on the above picture to clearly analyze what technology should be used. I also recommend it here仅供参考          

       If the latency requirements are not high, you can use Spark Streaming. It has a wealth of high-level APIs, is simple to use, and the Spark ecosystem is relatively mature, with large throughput, simple deployment, and high community activity. The number of stars from GitHub It can also be seen that the company still uses Spark mostly, and the new version also introduces Structured Streaming, which will also make the Spark system more complete.         

       If the latency requirements are very high, you can use Flink, the most popular stream processing framework at the moment, and use a native stream processing system to ensure low latency. The API and fault tolerance are also relatively complete. The use and deployment are relatively easy It is relatively simple to say. With the Blink contributed by Alibaba in China, I believe that Flink's functions will be more complete and development will be better in the future. The response speed of community issues is also very fast. In addition, there are special nailing groups and The Chinese list is for everyone to ask questions, and experts will give live explanations and answers every week.

This project: Use Flink to build a real-time computing platform

2. Demand analysis

The current demand is finally displayed in real-time through the report:

  1. Statistical analysis of users' daily activities (PV, UV, number of tourists) are displayed using histograms

94d6d6c538433f09103e91468828f4c1.jpg

2. Funnel display (number of payments, number of orders, number of shopping carts, number of views)

be517da95a35c9828aaa99400f104c63.jpg

3. Calculate one week's sales and use curve graph display

5c4b1c620e578c437972b6b022caf5d0.jpg4. 24-hour sales curve display

086e5cb949c8ebdf63f72a280d9c3270.jpg

5. Proportion of order status bc7dc3006aea23d01cd4ed5bcc5f2e71.jpg6.  Analysis of order completion status

30c1c5bd3fd6077a947a36762a6fa6cd.jpg

7. TopN Regional Ranking

2809e372d379facb253390e3411b5105.jpg

Data source PV/UV data source

  • Data from the embedded point of the page, send user access data to the web server
  • The web server directly writes this part of the data into the click_log topic of kafka

Data source of sales amount and order volume

  • Order data comes from mysql
  • The order data comes from the binlog log, and the data is written to the topic of kafka order in real time through canal

Shopping cart data and review data

  • Shopping cart data generally does not directly manipulate mysql, and is written to kafka (message queue) through the client program
  • Comment data is also written into kafka (message queue) through the client program

Three, architecture design

         According to the analysis requirements, we can design our architecture like this. bddf8a1cd0b28ef231e994d149b3d084.jpgOnline architecture diagram: https://gitmind.cn/app/flowchart/43aa8334090bdd1e1074271f08328e25

summary

          This article mainly explains how to choose a suitable technology stack and the architecture diagram of the technology real-time data warehouse shared later. We use hive in the offline data warehouse. We can make a layer in Hive. If we want to do real-time data warehouse, we need to use the message queue for layering. This project uses Kafka for layering. I am here to provide you with 大数据的资源friends who need it, you can go to the following GitHub to download, believe in yourself, hard work and sweat will always be rewarded. I am big data brother, see you next time~~~

Resource Acquisition To obtain Flink interview questions, Spark interview questions, essential software for programmers, hive interview questions, Hadoop interview questions, Docker interview questions, resume templates and other resources, please go to GitHub to download by yourself https://github.com/lhh2002/Framework- Of-BigData Gitee download it yourself https://gitee.com/li_hey_hey/dashboard/projects Real-time data warehouse code GitHub: https://github.com/lhh2002/Real_Time_Data_WareHouse


Guess you like

Origin blog.51cto.com/14417862/2593762