Good. Kugou Music's big data practice (pure dry goods) spark

This article is based on the content shared by Kugou Music big data architect Wang Jin in the [QCON High Availability Architecture Group]. Please indicate the source when forwarding.

Wang Jin: Currently working as Kugou Music, big data architect, responsible for Kugou big data technology planning, construction and application. 11 years of IT industry experience, 2 years of distributed application development, 3 years of practical experience in big data technology, the main research directions are streaming computing, big data storage computing, distributed storage systems, NoSQL, search engines, etc.

Editing and finishing: Chen Gang@Beijing Zhizhi


The main contents of this sharing include: what is big data, big data technology architecture, big data technology implementation, and four aspects of continuous improvement.

The big data platform is a huge systematic project. The entire construction period is very long, and the ecological chain involved is very long (including: data collection, access, cleaning, storage computing, data mining, visualization and other links, each link is regarded as a system. construction), the risk is also high.

1. What is Big Data

  1. The so-called "big data" (big data) refers to such a phenomenon: a company's daily operations generated and accumulated user behavior data "growth" so fast that it is difficult to use existing database management tools to control, the difficulty exists Data acquisition, storage, search, sharing, analysis and visualization. The amount of data is so large that it is no longer measured in units of G and T as we know it, but in terms of P (1000 T), E (one million T) or Z (1 billion T) ) is the unit of measurement, so it is called big data.

    (Illustration 1)

  2. Source of Big Data: For half a century, with the full integration of computer technology into social life, the explosion of information has accumulated to a point where it has begun to spark change. Not only is it flooding the world with more information than ever before, but its growth rate is accelerating. The explosion of information in disciplines such as astronomy and genetics has created the concept of "big data". Today, this concept applies to almost all areas of human intelligence and development. The 21st century is an era of great development of data and information. Mobile Internet, social network, e-commerce, etc. have greatly expanded the boundaries and application scope of the Internet, and various data are rapidly expanding and becoming larger. Internet (social, search, e-commerce), mobile Internet (microblogging), Internet of Things (sensors, smart earth), Internet of Vehicles, GPS, medical imaging, security monitoring, finance (banking, stock market, insurance), telecommunications (calls, text messages) are generating data like crazy.

    (Illustration 2)

  3. Big data features: "Volume, Variety, Velocity, and Low Value Density" are the significant 4V features of "big data", or, in other words, only data with these features, is big data.

    (Illustration 3)

     

  4. Problems to be solved by big data technology: Big data technology is designed to collect, discover and analyze very quickly (velocity) under conditions of affordable cost, from a large amount of (volume), multi-category (variety) data Extracting value will be a new generation of technology and architecture in the IT field.

    (Illustration 4)

  5. The application prospect of big data: From the perspective of application, through the storage, mining and analysis of big data, big data can make great achievements in the fields of marketing, enterprise management, data standardization and intelligence analysis. From the perspective of the application industry, on the one hand, big data can be applied to improve customer service levels and marketing methods; on the other hand, it can help enterprises in the industry to reduce costs and improve operational efficiency, and at the same time, it can also help enterprises to innovate and discover business models. New market opportunities. From the perspective of the value to the whole society, big data has huge potential application value in smart cities, smart transportation and disaster warning. Professional institutions predict that with the rapid development of Internet technology, cloud computing and Internet of Things applications are increasingly enriched, and the future development prospects of big data will be broader.

(Illustration 5)

Big data is now ubiquitous in life. As an IT technician, how does Kugou solve the problem of big data through technology?

2. Big data technology architecture of Kugou data center

The first generation of big data architecture:


(Illustration 6)

Mainly based on Hadoop1.x+hive for offline computing (T+1). Disadvantages: The calculation starts after all data arrives, and the distribution of cluster resource utilization is uneven. From 1:00 am to 12:00 noon, the cluster resources are the busiest; From 12:00 to 12:00, cluster resources are in an idle state.

In big data, the higher the timeliness of the data, the more valuable the data (such as: real-time personalized recommendation system, RTB system, real-time early warning system, etc.), therefore, the second-generation technical architecture has been developed. At present, the data center runs two sets of clusters (hadoop1.x, hadoop2.6) in parallel, the new business is directly connected to the new cluster, and the business data of the old cluster is being migrated to the new cluster; the results of the new cluster are compared with the results of the old cluster (very will all switch to the new cluster soon).

Second-generation big data technology architecture

From the perspective of data processing flow, it is divided into data source, data access, data cleaning, storage computing, data service, data consumption and other links. The big data processing flow is as follows:

(Illustration 7)

The overall architecture diagram of the second-generation big data technology is as follows:

(Illustration 8)

Big data computing is divided into real-time computing and offline computing. In the entire cluster, for real-time computing, we must use real-time computing stream processing to reduce the cluster's resource utilization concentration phenomenon through real-time computing streams.

Offline computing (batch processing): implemented through spark and spark SQL [I will not introduce Spark in detail, the fastest growing big data processing framework in recent years], the overall performance is 5-10 times higher than hive, and hive scripts are all being converted for Spark SQL.

Real-time computing: Based on Storm, Drools, Esper. [For the details of storm, I will not introduce it here. If you want to view the introduction, please pay attention to this official account and check the historical news. Drools (for details, see: http://www.drools.org/), Esper (for details, see: http://blog.csdn.net/luonanqin/article/category/1557469)]

HBase/MySQL: used for real-time computing, offline computing result storage service.

Redis: used for intermediate calculation result storage or dictionary data, etc.

ElasticSearch: Used for real-time query of detailed data and secondary index storage of HBase.

Through the new technical architecture, the timeliness of data is significantly improved compared to the original data. Now, data such as DAU and PV can be provided in real time, and the overall time of offline calculation is shortened by 50%.

3. Kugou big data technology implementation

The following will explain the implementation details of data acquisition and access, data cleaning, real-time monitoring system, and detailed query.

Data collection of components

From the data processing flow diagram, it can be known that the data sources are divided into front-end logs, server-side logs, and business data. The following explains how the data is collected and accessed.

a. Front-end log collection access:

Front-end log collection requires features such as real-time, reliability, and high availability. During technology selection, the open source data collection tools flume, scribe, and chukwa were tested and compared, and it was found that they basically could not meet the needs of business scenarios.

(Illustration 9)

For a detailed comparison, you can look at (http://www.ttlsa.com/log-system/scribe-chukwa-kafka-flume-log-system-contrast/). Therefore, choose to develop a front-end kafka proxy gateway based on kafka to complete the data collection requirements. Some detours were taken in the development process of the pre-proxy gateway. Finally, nginx+lua was used for development, and the kafka producer protocol was implemented based on lua. Kafka's producer protocol implementation, if you are interested, you can go to https://github.com/doujiang24/lua-resty-kafka to see it. Another colleague has implemented it. Now it is still active on github, and there are many people who ask for it. .

The specific implementation of the collection gateway is as follows:

[Data reliability]

The collection gateway provides the following two guarantees for data reliability:

  1. The data is sent to the collection gateway by the sdk, and the gateway returns the success after receiving the data, and the gateway ensures that the data is sent to kafka. 1) If the sdk does not receive success, it will retry (safety: bring the last task id); 2) js and flash have no retry logic

  2. The gateway communicates with kafka and provides two different types of guarantees: 1) at most once (successful sending at most once) 2) at least once (successful sending at least once)

[Reliability guarantee details of communication between gateway and kafka]

Premise: Because kafka itself does not provide a strong consistency guarantee (exactly once) in the overall design and communication protocol [http://kafka.apache.org/documentation.html#semantics] Therefore, in the communication between the gateway and kafka, Each data sent is divided into three states: 1) kafka is successfully written (two copies) 2) safe to retry (kafka definitely did not receive) 3) unsafe to retry (kakfa may have written or not)

In order to ensure the accuracy of the data, the gateway adopts the following strategies: 1) If it can be safely retried, it will be cached locally by the gateway, and then sent to kafka (redis first, then disk) 2) If it is not safe to retry, it will enter the specified fault-tolerant topic

Through the above strategies, ensure that [at most once] guarantees are provided in normal topics; and fault-tolerant topics are included to provide [at least once] guarantees;

The states of unsafe retry include: 1) Network error (send timeout, wait for response timeout) 2) RequestTimedOut state (this error code, data was not successfully written during dynamic partition migration; others will be successfully written )

With the two-layer guarantee, combined with the monitoring reported by the gateway (received amount, successfully sent amount, local cache amount), you can 1) monitor the operation of the entire system from the amount of each topic 2) For common business scenarios , only need to use topic 3) If the network has a long-term jitter, or kafka is down (which will lead to more unsafe retry content), special processing may be required for the data in the fault-tolerant topic

b. Back-end log collection access:

FileCollect, considering that the environment variables of many online environments cannot be changed, in order to reduce intrusiveness, Go language is currently used to implement file collection.

The overall data structure of the front-end and server-side is as follows:

(Illustration 10)

c. Business data access

Canal: Use Canal to synchronize incremental business data in real time through MySQL's binlog mechanism. (For an introduction to canal, refer to: http://agapple.iteye.com/blog/1796633)

Component data cleaning (ETL)

The implementation method of data acquisition and access is introduced above. The following is based on the Storm framework. First, data cleaning (ETL) is introduced. The ETL here is only for simple data escaping, completion, and abnormal data processing. .

(Illustration 11)

Storm (data cleaning) Kafka Spout is responsible for consuming Kafka data

IsDecode Bolt is responsible for determining whether the data is decoded

Decode Bolt is responsible for data decoding

Rules Bolt is responsible for data rule parsing, and introduces a rule engine (Drools) to solve data change requirements.

FormatRule is responsible for data formatting rules and adapts to different formats.

DataAdapter is responsible for data storage adaptation, and needs to be adapted to HDFS, HBase, Spark, database, etc.

Error Bolt is responsible for writing abnormal data to HDFS, which is convenient for detailed query of abnormal data.

Stat Bolt counts the amount of data consumed from kafka

When using Storm, I encountered the problem of how to implement dynamic changes when the business configuration needs to be changed? Based on the event-driven way to solve the dynamic configuration change requirements, Kafka was initially considered to create an event queue, which was achieved by monitoring the event data of the queue. Later, I found Storm-Signal on the omnipotent GitHub and implemented it using Storm-Signal. For details, see: https://github.com/ptgoetz/storm-signals. When implementing the rule engine based on Drools, it is necessary to solve how to make the modified rule file take effect without restarting the topology. It is also based on the Storm-Signal component and converts the drl file into a stream and stores it in redis, so that the drl stream file is dynamically obtained.

Real-time monitoring of components

Next, I will introduce the implementation of the real-time monitoring system based on Storm. Before the introduction, let's introduce OpenTSDB. The storage of real-time monitoring is handled by OpenTSDB.

a.OpenTSDB

OpenTSDB is an open source database that stores time series data based on HBase. To be precise, it is just an application of HBase, and its processing of time series data can be used for reference and reference by other systems. OpenTSDB uses HBase as the storage center. It does not require sampling, can completely collect and store hundreds of millions of data points, and supports data monitoring at the second level. Thanks to HBase's distributed columnar storage, HBase can flexibly support the increase of metrics. It can support the collection of tens of thousands of machines and hundreds of millions of data points. In openTSDB, TSD is the daemon program for HBase's external communication. There is no master/slave distinction, and there is no shared state. Therefore, using this and the characteristics of HBase clusters can eliminate single points. Users can directly access the TSD interface through telnet or http protocol, and can also access TSD through rpc. Each server that needs to obtain metrics needs to set up a Collector to collect time series data. The Collector is your script to collect data.

(Illustration 12)

If you want to quickly display the number of delete clauses executed in mysql over a period of time, the number of slow queries, the number of temporary files created, and the number of 99% delays, etc. OpenTSDB can easily store and process more than one million data points, and can dynamically generate corresponding graphs in real time, as shown below:

(Illustration 13)

OpenTSDB uses async HBase, a fully asynchronous, non-blocking, thread-safe, HBase api that uses fewer threads, locks, and memory to provide higher throughput, especially for heavy write operations. The following figure shows the read and write process:

(Illustration 14)

In HBase, the design of the table structure has a great impact on performance, among which the tsdb-uid table and the tsdb table are shown in Table 1 and Table 2

b. Real-time monitoring system

Overall architecture diagram:

(Illustration 17)

Kafka Spout is responsible for reading Kafka data

Decode Bolt is responsible for log decoding

Detail Bolt is responsible for raw data storage ES cluster, prompting real-time raw log query.

Stat Rules Bolt is responsible for log format parsing, and introduces a rule engine (Drools) to solve the data format change requirements.

TSD Bolt is responsible for the storage of multi-dimensional statistical results, and establishes the mapping relationship between statistical indicators and Rowkey through the TSD service

Alarm Bolt is responsible for writing multi-dimensional statistical results to the kafka Alarm Queue.

The bottleneck in the real-time monitoring system is not in the real-time calculation, but in the result storage? In terms of storage, a lot of time was spent on tuning tests, which also referred to some of Ctrip's suggestions for OpenTSDB (Ctrip's OpenTSDB is used very well, and it seems that it has applied for a patent). For storage, we mainly made the following improvements and optimizations to it, and customized and modified the source code according to our needs. Improvements and optimizations for OpenTSDB

(1) Remove the grouping interpolation during aggregation and directly aggregate

(2) Modify the timestamps in startkey and endkey, query only the required rows, and add a time zone offset to the query result timestamp.

(3) A downsampling table is added for queries with larger sampling granularity, which reduces the number of rowkeys and improves query performance.

(4) Add coprocessor support to reduce io and serialization/deserialization overhead.

The fourth point is a big change. The query processing function of OpenTSDB is moved from the client to the server (coprocessor), which greatly reduces the overhead of io and serialization/deserialization, and the performance is improved significantly. Another point is that we improve performance by reducing dimensionality and adding partition identifiers, which mainly use the partitioning characteristics of HBase.

Detailed query of components

Next, I will introduce to you, the detailed data query plan, the introduction of the search engine, the structure is as follows:

(Illustration 18)

At present, the detailed log query function of the real-time monitoring system has been realized. Based on Storm, the detailed data is converted into the corresponding format, and logs are written to ES in real time through the ElasticSearch-Storm component, and then queried in real time through ES. We are also using ES to solve the secondary index problem of HBase.

Regarding Storm, we have extracted some common components to improve development efficiency.

(Illustration 19)

4. Continuous Improvement

The current problems in our data center:

1. Large amount of business code development

2. Mass data query problem

3. Interpretation of data

In response to the above problems, improvements are made in the following aspects:

1. Implement Lambda architecture based on SummingBird

2. Big data storage and query optimization

3. Data visualization application. If you are interested in SummingBird, you can refer to: http://clockfly.diandian.com/post/2014-05-19/summingbird introduction application

Q&A

Q1. I would like to ask Mr. Wang: In what scenarios is spark mainly used at present, does the all-in-one mode of batch processing, stream processing, and data warehouse SQL have advantages in comparison? And how does Spark share cluster resources with other platforms? grateful!

spark Most of our current new businesses are implemented with spark and spark SQL, such as user portraits. Batch processing and stream processing are mainly aimed at massive data. In big data, the support of SQL is not very complete. Spark is mainly based on yarn, and yarn is responsible for resource management. We manage resources through fair scheduling strategies.

Q2. When visualizing data, how to control the data permissions of different roles?

We have not yet achieved data permissions through the function points corresponding to roles, and will be implemented through acl in the future. It has been provided in hadoop 2.6, and the version of our cluster is already the latest.

Q3. Ask a question. We found that spark sql is very unstable and crashes frequently, while hive is still very stable. How do you solve it?

We are using the spark 1.3 version, and using spark sql will also encounter some problems, which can be solved by adjusting the memory parameters, usually through the spark original api.

Q4. What is the communication between the sdk that collects logs and the gateway? Long connection, or http? How to coordinate multiple different clients?

Through the http protocol, it defines its own communication protocol, not a long connection. Different clients of mobile phone and pc are all logos.

Q5. Both storm and spark have real-time computing features. What scenarios are they suitable for?

For those with high requirements on stability and real-time performance, we will use Storm. The current version of spark Streaming has a relatively large delay, and the support for kafka is not very perfect, and there are many pitfalls.

Q6. How many people are maintaining this architecture, I feel that the language and framework are used a lot; after modifying the framework, if there is a version upgrade, how to deal with the modified content when upgrading the version

More than a dozen people maintain, the language is mainly java, scala. Our changes are relatively large, so we will add some new features in the future. When the opportunity is ripe, we can also consider open source

Q7. Can you tell us more about the achievements and general plans for the realization of user portraits with big data? interested.

In fact, the portrait is still the first version. Data training and classification are separated. Spark uses the trained model to run user labels through behavior data. In the second phase, the labeling system will be improved based on spark Streaming spark mllib HBase. The plan has been determined. .

Q8. Compared with the Hadoop ecosystem, Spark's current batch processing, stream processing, and data warehouse SQL is an all-in-one mode. Does it have any advantages in actual use? And how does Spark share cluster resources with other platforms?

Compared with mr and hive, the execution efficiency of the job is high, and more tasks are processed in the same time, but the stability of hive is currently better than that of spark SQL. Spark is deployed on yarn, and yarn is responsible for cluster resource scheduling.

Supplement: That is to say, the efficiency of spark in batch processing is still higher than that of mr, and all in one still processes data separately and stores it externally. There is no direct data exchange in spark, right?

Yes, it is finally stored in hdfs, tachyon is introduced, and tachyon improves data reading.

Supplement: I am now working on a small project, the purpose is to realize the intercommunication of memory cache data between multiple spark applications, such as streaming to form a history directly for SQL use without landing. Not sure if it fits the scene?

There are many application scenarios for kafka Spark streaming. Later, we will use this solution for portraits.

Thanks to Chen Gang for recording and arranging, and many other editing team volunteers also contributed to this article. For more information on architecture, please long press the picture below, follow the "High Availability Architecture" official account, and gain valuable experience leading to architects.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326433751&siteId=291194637
Recommended