Preface
When we are doing a project, we often need to choose what technology to use. This part is not what our ordinary employees think, but the architect will choose the appropriate technology according to the customer's needs. When choosing the right technology, our development will be more effective. Below I will explain 实时数仓
how to select the project ( ) I made .
One, technology selection
When we choose technology, we need to choose according to customer needs. For example: 实时统计交易金额(要求延迟不能超过一秒)
At this time, when we choose the technology, we cannot use those batch processing technologies such as Hive
, MapRducer
etc., because MapRducer may be started for more than one second, so these requirements cannot be met at all. At this time, we can consider using some real-time computing techniques such as Flink
, SparkStreaming
etc. Next, we will explain how to choose.
Currently the market is a lot of real-time computing technologies such as: Spark streaming
, Struct streaming
, Storm
, JStorm(阿里)
, Kafka Streaming
, Flink
and many other technology stack that how do we choose?
When we choose a technology, we need to consider it all. It's not that you want to use this technology if you like it. This is not a wise choice. According to general business 公司员工的技术基础
, 流行
, 技术复用
, 场景
and many other factors to choose. Attach a technical picture based on the above picture to clearly analyze what technology should be used. I also recommend it here仅供参考
If the latency requirements are not high, you can use Spark Streaming. It has a wealth of high-level APIs, is simple to use, and the Spark ecosystem is relatively mature, with large throughput, simple deployment, and high community activity. The number of stars from GitHub It can also be seen that the company still uses Spark mostly, and the new version also introduces Structured Streaming, which will also make the Spark system more complete.
If the latency requirements are very high, you can use Flink, the most popular stream processing framework at the moment, and use a native stream processing system to ensure low latency. The API and fault tolerance are also relatively complete. The use and deployment are relatively easy It is relatively simple to say. With the Blink contributed by Alibaba in China, I believe that Flink's functions will be more complete and development will be better in the future. The response speed of community issues is also very fast. In addition, there are special nailing groups and The Chinese list is for everyone to ask questions, and experts will give live explanations and answers every week.
This project: Use Flink to build a real-time computing platform
2. Demand analysis
The current demand is finally displayed in real-time through the report:
Statistical analysis of users' daily activities (PV, UV, number of tourists) are displayed using histograms
2. Funnel display (number of payments, number of orders, number of shopping carts, number of views)
3. Calculate one week's sales and use curve graph display
4. 24-hour sales curve display
5. Proportion of order status 6. Analysis of order completion status
7. TopN Regional Ranking
Data source PV/UV data source
- Data from the embedded point of the page, send user access data to the web server
- The web server directly writes this part of the data into the click_log topic of kafka
Data source of sales amount and order volume
- Order data comes from mysql
- The order data comes from the binlog log, and the data is written to the topic of kafka order in real time through canal
Shopping cart data and review data
- Shopping cart data generally does not directly manipulate mysql, and is written to kafka (message queue) through the client program
- Comment data is also written into kafka (message queue) through the client program
Three, architecture design
According to the analysis requirements, we can design our architecture like this. Online architecture diagram: https://gitmind.cn/app/flowchart/43aa8334090bdd1e1074271f08328e25
summary
This article mainly explains how to choose a suitable technology stack and the architecture diagram of the technology real-time data warehouse shared later. We use hive in the offline data warehouse. We can make a layer in Hive. If we want to do real-time data warehouse, we need to use the message queue for layering. This project uses Kafka for layering. I am here to provide you with 大数据的资源
friends who need it, you can go to the following GitHub to download, believe in yourself, hard work and sweat will always be rewarded. I am big data brother, see you next time~~~
Resource Acquisition To obtain Flink interview questions, Spark interview questions, essential software for programmers, hive interview questions, Hadoop interview questions, Docker interview questions, resume templates and other resources, please go to GitHub to download by yourself https://github.com/lhh2002/Framework- Of-BigData Gitee download it yourself https://gitee.com/li_hey_hey/dashboard/projects Real-time data warehouse code GitHub: https://github.com/lhh2002/Real_Time_Data_WareHouse