Music data center platform based on big data (attachment: source code courseware project deployment document)

Project Introduction

The data warehouse comprehensive project of the music data center is mainly aimed at the data collected by the company in the past, such as users' on-demand and purchased music (including business data and user behavior data), and provides decision-making service support (BI business decision-making) for the healthier development of the company's business.


The data center project includes business system data and user behavior log data.

Business data refers to the business data generated by the business system, such as order, login, song order, advertisement display and other data generated in the system.

User behavior data For example: the behavior of users operating on the physical machine is user behavior data, such as clicks, favorites, scanning codes and other events.

The results of the company's analysis of the above data mainly have two applications:

One is for the BI system, business intelligence mainly displays more reports for the company's operating personnel to refer to. For example: daily singing volume, daily revenue, machine distribution, real-time pv, uv, user retention rate, funnel model, etc.

Another application is data services. Data services are mainly used to provide business systems access to the analyzed result data in the form of interfaces. For example, a recommendation system recommends songs based on songs, recommends songs based on singers, or recommends songs based on users.

project structure

The data center project is a comprehensive data warehouse project of Spark. Divided into offline processing and real-time processing, the technologies used include MySQL, Sqoop, HDFS, Yarn, Hive, data warehouse model design, SparkCore, SparkSQL, SparkStreaming, Azkaban, Flume, Kafka, Redis, superSet, Redis, WeChat index, Gaode API, etc.

Offline processing: mainly Spark, which rarely uses the code of SparkCore, and mainly uses SparkSQL to build the data warehouse. The project uses airflow/Azkaban for scheduling, which can be scheduled daily or monthly, and the scheduling is triggered regularly every day.

Real-time processing: Use SparkStreaming to achieve real-time processing. The offline N+1 method cannot obtain real-time data. The operation center sometimes needs real-time user online data to bury data for the client. All user behaviors on the client are events. When some events are detected, it is judged whether the user satisfies the target user situation.

The real-time processing flow is as follows:

http request-->data collection interface-->data directory-->flume monitoring directory [the files in the monitoring directory are divided according to date] -->Kafka [will also be placed in HDFS, which is the above] - ->SparkStreaming analysis data--> Kafka[for the operation center]


Data source and collection

The data source in the data center project is mainly the data generated by the KTV karaoke hall in the shopping mall and the data generated by the dancing machine in the shopping mall.

The data generated by the above two devices can be divided into two types of data, one is the generated order data, which will be recorded in the business database. In the later stage, the data in MySQL is directly extracted to HDFS through sqoop. The other type is uploading to a log server dedicated to collecting data through http requests. The operation and maintenance personnel will package and upload the data to a certain directory on the data center platform every day, and then the scheduled task will execute the Spark task to pull the data and upload the data. to HDFS. Here, the compressed data is read using SparkCore for processing, and after processing, the data can be stored in HDFS in parquet format or json format.

Data Warehouse Model

The data warehouse is divided into three themes according to themes: users, machines, and content (song-related, singer-related). Each topic has a corresponding table below it. The design of the data warehouse is divided into three layers, as follows:

ODS layer:

The ODS file is the original data of the data table extracted from the business database. The data is imported from the relational database MySQL, and the file converted into Parquet format is stored in HDFS. It is convenient to use SparkSQL for processing later.

The sources of ODS layer data are as follows:

External data source: NetEase Cloud crawls song popularity data and singer popularity data, and the crawled data is data in json format.

Internal data sources: mainly MySQL and client upload json data. MySQL uses Sqoop to extract data into HDFS and imports it into the ODS layer. The client generates logs to the client server, and the operation and maintenance personnel on the client server compress the data into packages and import them into the HDFS path every day, that is, the ODS layer.

EDS layer:

The EDS layer is responsible for information integration and light summary data. A simple understanding is to organize transactional data into warehouse dimensional modeling data that is easy to analyze, and do some light aggregation, similar to wide tables in Hive. For example: to clean the ODS layer data, if the topic is a user topic, then organize the data together according to the granularity of the user ID. If the subject is a machine, then organize the data together according to the granularity of the machine id.

The above ODS layer and EDS layer use Spark code to process data, and then use SparkSQL to read ODS layer data and save it to the EDS layer of Hive.

DM layer:

Part of the data in the DM layer is stored in the Hive table, or the analysis results are saved to MySQL, HBase, etc. The EDS layer data is in parquet format. The main reason for putting it in Hive is to use Kylin to query some businesses later. The data in MySQL is all result data. The reason for putting it in HBase is to set up detailed queries involving large tables.

The design tables of the above data warehouse models correspond to each other in the "data warehouse model.xlsx" file.

The first business: song popularity and singer popularity ranking

need


The demand is based on the user's ordering behavior on each song jukebox, to count the number of songs ordered, the number of song likes, the number of users who ordered songs, the number of orders ordered, and the number of songs ordered in the past 7 days And the highest number of songs on the 30th, the highest number of likes on the 7th and 30th, and the popularity of songs and singers in each cycle.

model design

To complete the analysis of yesterday's song popularity and singer popularity, the following two types of data are needed:

Basic information of song artist

This information is placed in the relational database MySql song table of the business system. Full coverage is extracted to the ODS layer in the data warehouse Hive through sqoop every day. For the structural information of the song table, please refer to the "data warehouse model.xls" file and the mysql data song table data.

User's song ordering behavior data on the machine

This part of the data is the user’s song ordering and playing behavior data on each machine of the day. These data are packaged and uploaded to the HDFS platform in the form of gz compressed files at zero o’clock every day. Here we assume that the data “currentday_clientlog.tar.gz” is stored every morning Regularly upload to the HDFS path "hdfs://mycluster/logdata". Here, in the enterprise, it should be uploaded to a structure directory named after Tian, ​​and the data will be stored in the ODS layer of the Hive data warehouse through Spark data cleaning. For the user's song ordering behavior data on the machine, refer to the "Event Reporting Protocol.docx" document.

Here, the table is divided into the following three layers according to the requirements:


The second business: machine detailed information statistics

need

At present, the basic detailed information of the machine needs to be calculated based on the data in the two business systems. The relational databases corresponding to these two business systems are "ycak" and "ycbk" respectively.

There are two machine-related database tables in the "ycak" library as follows:

"machine_baseinfo" machine basic information table, the machine's system version, song library version, UI version, and the latest login time are related.

"machine_local_info" daily full list of machine location information, provinces, cities and counties where the machine is located and detailed address, operating time and sales time are related.

There are 6 tables in the "ycbk" library, which are as follows:

"machine_admin_map" machine client mapping data table

"machine_store_map" machine store mapping relationship table

"machine_store_info" full store information table

"province_info" machine province daily full scale

"city_info" Machine City Daily Full Scale

"area_info" Machine Area and County Daily Full Scale

Note: All machine information comes from the "machine_baseinfo" machine basic information table and the "machine_admin_map" machine customer mapping data table.

model design

To complete the above machine detailed information statistics, the data is stored in two business system libraries respectively, and the data needs to be extracted from the relational database to the Hive ODS layer through ODS.

According to the requirements, analyze the machine. In the data warehouse, we build the "machine" theme. The specific data is layered as follows:


Create a table corresponding to the ODS layer in Hive:

In addition to processing each table structure of the ODS layer, the TW_MAC_BASEINFO_D table of the EDS layer also needs a table corresponding to the corresponding DM layer. Here, the DM layer has a corresponding tm_mac_baseinfo_d table in mysql. The data transfer process between the above tables is as follows:


Data processing flow

Import the data into the corresponding MySQL business database

Create "ycak" and "ycbk" databases in MySQL, respectively run "ycak.sql" and "ycbk.sql" under the corresponding databases, and import the data into the business database.

Use Sqoop to extract data to the Hive ODS layer

Import the "machine_baseinfo" table under the "ycak" library into the TO_YCAK_MAC_D table of the ODS layer through Sqoop every day, and import the script "ods_mysqltohive_to_ycak_mac_d.sh"

Configure task flow with Azkaban

Azkaban is used here to configure task flow for task scheduling. To submit tasks in the cluster, you need to modify the configuration item in the application.conf file in the project: local.run="false". and package the project.

Data Visualization with SuperSet


The third business: daily active user statistics

need


The detailed information of active users in the last 7 days is counted every day, as shown in the example above.

If you calculate the user’s daily activity, you need to obtain the corresponding daily user login information. The user login information is recorded in the “user_login_info” table under the “ycak” business database. This table records the information of the user logging in to the system and logging out of the system every day. , we can incrementally extract the information in this table to the ODS layer every day, and then extract the basic information of users to the ODS layer in full every day, and then obtain the daily active user information, and then calculate and count the 7-day user activity. The basic information of users here includes the data of four types of registered users, which are stored in the "ycak" business database, namely: "user_wechat_baseinfo", "user_alipay_baseinfo", "user_qq_baseinfo", and "user_app_baseinfo".

model design

Extract the data tables required by the business to the ODS layer data through Sqoop. According to the business, we build "user topics" in the data warehouse. The specific data is layered as follows:

Guess you like

Origin blog.csdn.net/lxianshengde/article/details/124821032