Storage design of tens of billions of network public opinion analysis system

Abstract:  Preface Under the current wave of Internet information, the speed of information dissemination is far beyond our imagination. A post from a big V on Weibo, a status update from a circle of friends, a piece of news on a popular forum, or a shopping review on a shopping platform may generate tens of thousands of retweets, concerns, and likes. If it is some irrational negative comments, it will arouse people's negative feelings, and even affect consumers' recognition of the corporate brand. If correct measures cannot be taken in time, it will cause inestimable losses.

foreword

Under the current wave of Internet information, the speed of information dissemination is far beyond our imagination. A post from a big V on Weibo, a status update from a circle of friends, a piece of news on a popular forum, or a shopping review on a shopping platform may generate tens of thousands of retweets, concerns, and likes. If it is some irrational negative comments, it will arouse people's negative feelings, and even affect consumers' recognition of the corporate brand. If correct measures cannot be taken in time, it will cause inestimable losses. Therefore, we need an efficient network-wide public opinion analysis system to help us observe public opinion in real time.

This network-wide public opinion analysis system can store tens of billions of web page data, capture and store new web pages in real time, and extract real-time metadata for new web pages. With the extraction results, we also need to conduct further mining analysis, which includes but is not limited to

Diagnosis of the influence of public opinion, making predictions from the magnitude of the spread and the trend of diffusion, to determine whether the public opinion will eventually be formed. 
Propagation path analysis, analyze the key path of public opinion dissemination. 
User portraits provide an outline of common features for participants in public opinion, such as gender, age, region and topics of interest. 
Sentiment analysis, analyzing whether news or reviews are positive or negative. Statistical aggregation is performed after sentiment classification. 
Early warning setting, we support the setting of public opinion discussion volume threshold, and notify the push business party after reaching the threshold, so as to avoid missing the golden participation time of public opinion. 
These mined public opinion results will be pushed to the demand side, and an interface is also provided for various business parties to search and query. Next, we will discuss the problems that may be encountered in the system design. We will focus on the storage-related topics in the system design, and find an optimal solution for these problems.

system design

For a public opinion system, first of all, a crawler engine is needed to collect various news information from major mainstream portals, shopping websites, original page content of community forums, Weibo, and circle of friends. The collected massive web pages and message data (tens of billions) need to be stored in real time. Before obtaining a webpage according to the website url, it is necessary to judge whether it is a page that has been crawled before, so as to avoid unnecessary repeated crawling. After collecting the webpage, we need to extract the webpage, remove unnecessary tags, extract the title, abstract, body content, comments, etc. The extracted content enters the storage system to facilitate subsequent queries. At the same time, the newly added extraction results need to be pushed to the computing platform for statistical analysis, report generation, or subsequent public opinion retrieval and other functions. The content of the calculation may require new data or full data depending on the algorithm. The time-sensitive sensitivity of public opinion itself determines that our system must be able to efficiently process these new content. It is best to retrieve new hot searches after a delay of seconds.

We can summarize the entire data flow as follows:

image description

According to the above figure, it is not difficult to find that to design a storage and analysis platform for the whole network public opinion, we need to deal with crawling, storage, analysis, search and display. Specifically, we need to solve the following problems:

How to efficiently store tens of billions of original web page information, in order to improve the comprehensiveness and accuracy of public opinion analysis, we often hope to crawl as much web page information as possible, and then aggregate it according to the weight we set. Therefore, the entire historical database of web pages will be relatively large, accumulating tens of billions of web page information, and the amount of data can reach hundreds of terabytes or even several petabytes. In the case of such a large amount of data, we also need to achieve low latency of reading and writing milliseconds, which makes it difficult for traditional databases to meet the demand. 
How to judge whether it has been crawled before before the crawler crawls the webpage? For ordinary webpages, public opinion cares about their timeliness. Maybe we only want to crawl the same webpage once, then we can use the webpage address to crawl Heavy, reduce unnecessary waste of web page resources. So we need distributed storage to provide efficient random queries based on web pages. 
How to perform real-time structured extraction after the new original webpage is stored, and store the extraction results. Here our original web page may include various html tags, we need to remove these html tags, and extract the title, author, publishing time, etc. of the article. These contents provide the necessary structured data for subsequent public opinion sentiment analysis. 
How to efficiently connect to the computing platform and stream the newly extracted structured data for real-time computing. Here we need to classify according to the content of web pages and message descriptions, perform emotion recognition, and perform statistical analysis of the results after recognition. Due to the poor timeliness of full-scale analysis and the fact that public opinion often focuses on the latest news and comments, we must do incremental analysis. 
How to provide efficient public opinion search, in addition to subscribing to fixed keyword public opinion, users do some keyword searches. For example, you want to understand some public opinion analysis of new products of competing companies. 
How to realize the real-time push of new public opinion? In order to ensure the timeliness of public opinion, we not only need to persist the results of public opinion analysis, but also support the push of public opinion results. The pushed content is usually the new public opinion that we analyze in real time. 
system structure

In response to the above-mentioned problems, let's introduce how to build a public opinion analysis platform of tens of billions of dollars on the whole network based on various cloud products on Alibaba Cloud. We will focus on the selection of storage products and how to efficiently connect various types of computing. , search platform.

image description

We use ECS as the crawler engine, and we can decide the number of machine resources to use ECS according to the amount of crawling. We can also temporarily expand resources for web crawling during the peak of each day. After the original webpage is crawled, the original webpage address and webpage content are written into the storage system. At the same time, if you want to avoid repeated crawling, the crawler engine should deduplicate according to the url list before crawling. The storage engine needs to support low-latency random access queries to determine whether the current url already exists, and if so, there is no need to crawl again.

In order to achieve real-time extraction of the original content of web pages, we need to push the newly added pages to the computing platform. The previous architecture often requires double writing of the application layer, that is, the original web page data is stored in the database, and we repeatedly write a copy of the data into the computing platform. Such an architecture would require us to maintain two sets of writing logic. Similarly, there are similar problems in the structured incremental entry into the public opinion analysis platform. The extracted structured metadata also needs to be double-written into the public opinion analysis platform. The analysis results of public opinion also need to be written into distributed storage and pushed to the search platform. Here we can find that the three red lines in the figure will bring our double write requirements for the three data sources. This will increase the workload of code development, and will also lead to complex system implementation and maintenance. The double writing of each data source needs to be aware of the existence of the downstream, or use the message service to do decoupling through double writing of messages. Traditional databases such as mysql support the subscription of incremental log binlog. If the distributed storage product can support large access and storage capacity, it can also provide incremental subscription, which can greatly simplify our architecture.

After the web page data is collected and stored, it flows incrementally into our computing platform for real-time metadata extraction. Here, we can choose Function Compute. When there are new pages that need to be extracted, the managed function of Function Compute is triggered to extract webpage metadata. After the extracted results are stored in the storage system and persisted, they are simultaneously pushed to MaxCompute for public opinion analysis, such as sentiment analysis and text clustering. There may be some public information table data, user sentiment data statistics and other results. Public opinion results will be written to storage systems and search engines, and some reports and threshold alarms will be pushed to subscribers. The data of the search engine is provided to the online public opinion retrieval system.

After introducing the complete architecture, let's take a look at how to make storage selection on Alibaba Cloud.

Storage selection

Through the introduction of the architecture, let's summarize the requirements for storage selection:

It can support massive data storage (TB/PB level), high concurrent access (100,000 to 10 million TPS), and low access latency. 
The collection volume will be dynamically adjusted as the web page source of the collection and subscription is adjusted. At the same time, within a day, the number of web pages crawled by the crawler in different time periods will also have obvious peaks and troughs, so the database needs to be able to expand and shrink flexibly. 
With the free table attribute structure, the attributes we need to pay attention to may be quite different for the information of ordinary web pages and social platform pages. A flexible schema will facilitate our expansion. 
For old data, you can choose automatic expiration or tiered storage. Because public opinion data often focuses on recent hotspots, old data is accessed less frequently. 
A better incremental channel is required, and newly added data can be regularly exported to the computing platform. There are three red dashed lines in the above figure. These three parts have a common feature. The increment can be led to the corresponding computing platform for calculation in real time, and the calculated result can be written to the corresponding storage engine. If the database engine itself supports increments, the architecture can be greatly simplified, reducing the need to filter increments in the full read area before, or double-write the client to achieve the logic of increments. 
There needs to be a better search solution (it supports itself or can seamlessly connect data to search engines). 
With these requirements, we need to use a distributed NoSQL data to solve the storage and access of massive data. The demand for incremental data access in multiple links and the peak access fluctuations of the business further determine that Table Store for flexible billing is our best choice in this architecture. For an introduction to the schema of Table Store, please refer to Table Store Data Model

Compared with similar databases, TableStore (Table Store) has a great functional advantage that TableStore (Table Store) has a relatively complete incremental interface, namely the Stream incremental API. For the introduction of Stream, please refer to Table Store Stream Overview. Scenario introduction can refer to Stream application scenario introduction, and specific API usage can refer to JAVA SDK Stream. With the Stream interface, we can easily subscribe to all modification operations of TableStore (table storage), that is, various new data. At the same time, we have built a lot of data channels based on Stream to connect with various downstream computing products. Users do not even need to call the Stream API directly, and use our channel to directly subscribe to incremental data downstream, naturally accessing the entire Alibaba Cloud computing ecosystem. For the function computing mentioned in the above architecture, MaxCompute, ElasticSearch, DataV, TableStore (table storage) are all supported. For details, please refer to:

Stream and Function Compute are  connected to Stream
and MaxCompute. 
Stream and Elasticsearch 
use DataV to display the data stored in 
Table Store (Table Store) on the attribute column, which is a free table structure. For the scenario of public opinion analysis, with the upgrade of the public opinion analysis algorithm, we may add new attribute fields, and the attributes of social pages such as ordinary web pages and Weibo may also be different. Therefore, the free table structure can better match our needs compared with traditional databases.

In the architecture, we have three repository requirements. They are the original page library, the structured metadata database and the public opinion result library. The first two are generally an offline storage and analysis library, and the last one is an online database. They have different requirements for access performance and storage cost. Table Store has two types of instance types that support storage tiering, high performance and capacity. High performance is suitable for scenarios with multiple writes and multiple reads, that is, as online business storage. The capacity type is suitable for scenarios with more writes and fewer reads, that is, offline business storage. Their single-line write latency can be controlled within 10 milliseconds, and read high performance can be maintained at the millisecond level. TableStore also supports TTL and sets the table-level data expiration time. According to the demand, we can set the TTL of the public opinion results, only provide the query of recent data, and the older public opinion will be automatically expired and deleted.

With these features of TableStore (table storage), the six requirements of the system for storage selection can be well satisfied. Based on TableStore (table storage), the entire network public opinion storage and analysis system can be perfectly designed and implemented.

postscript

This article summarizes the storage and analysis problems encountered in the scenario of mass data public opinion analysis, and introduces how to use Alibaba Cloud's self-developed TableStore (table storage) on the premise of meeting the basic data volume of the business, through Stream The docking of the interface and the computing platform realizes the simplification of the architecture. TableStore (Table Store) is a professional-grade distributed NoSQL database independently developed by Alibaba Cloud. It is a high-performance, low-cost, easy-to-expand, fully managed semi-structured data storage platform based on shared storage. One of the important applications in the field of data processing. For other scenarios, please refer to TableStore Advanced Road.

Author: Yu Heng

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326215530&siteId=291194637