Storage design of tens of billions of network public opinion analysis system

Abstract:  Preface Under the current wave of Internet information, the speed of information dissemination is far beyond our imagination. A post from a big V on Weibo, a status update from a circle of friends, a piece of news from a popular forum, or a shopping evaluation on a shopping platform may generate tens of thousands of retweets, concerns, and likes. If it is some irrational negative comments, it will arouse people's negative feelings, and even affect consumers' recognition of the corporate brand. If correct measures cannot be taken in time, it will cause inestimable losses.

foreword

Under the current wave of Internet information, the speed of information dissemination is far beyond our imagination. A post from a big V on Weibo, a status update from a circle of friends, a piece of news from a popular forum, or a shopping evaluation on a shopping platform may generate tens of thousands of retweets, concerns, and likes. If it is some irrational negative comments, it will arouse people's negative feelings, and even affect consumers' recognition of the corporate brand. If correct measures cannot be taken in time, it will cause inestimable losses. Therefore, we need an efficient network-wide public opinion analysis system to help us observe public opinion in real time.

This network-wide public opinion analysis system can store tens of billions of web page data, capture and store new web pages in real time, and extract real-time metadata for new web pages. With the extraction results, we also need to conduct further mining analysis, which includes but is not limited to

Diagnosis of the influence of public opinion, making predictions from the magnitude of the spread and the trend of diffusion, to determine whether the public opinion will eventually be formed. 
Propagation path analysis, analyze the key path of public opinion dissemination. 
User portraits provide an outline of common features for participants in public opinion, such as gender, age, region and topics of interest. 
Sentiment analysis, analyzing whether news or reviews are positive or negative. Statistical aggregation is performed after sentiment classification. 
Early warning setting, we support the setting of public opinion discussion volume threshold, and notify the push business party after reaching the threshold, so as to avoid missing the golden participation time of public opinion. 
These mined public opinion results will be pushed to the demand side, and an interface is also provided for various business parties to search and query. Next, we will discuss the problems that may be encountered in system design. We will focus on storage-related topics in system design and find an optimal solution for these problems.

system design

For a public opinion system, first of all, a crawler engine is needed to collect various news information from major mainstream portals, shopping websites, original page content of community forums, Weibo, and circle of friends. The collected massive web pages and message data (tens of billions) need to be stored in real time. Before obtaining a webpage according to the website url, it is necessary to judge whether it is a page that has been crawled before, so as to avoid unnecessary repeated crawling. After collecting the webpage, we need to extract the webpage, remove unnecessary tags, extract the title, abstract, body content, comments, etc. The extracted content enters the storage system to facilitate subsequent queries. At the same time, the newly added extraction results need to be pushed to the computing platform for statistical analysis, report generation, or subsequent public opinion retrieval and other functions. The content of the calculation may require new data or full data depending on the algorithm. The time-sensitive sensitivity of public opinion itself determines that our system must be able to efficiently process these new content. It is best to retrieve new hot searches after a delay of seconds.

We can summarize the entire data flow as follows:

image description

According to the above figure, it is not difficult to find that to design a storage and analysis platform for the whole network public opinion, we need to deal with crawling, storage, analysis, search and display. Specifically, we need to solve the following problems:

How to efficiently store tens of billions of original web page information, in order to improve the comprehensiveness and accuracy of public opinion analysis, we often hope to crawl as much web page information as possible, and then aggregate it according to the weight we set. Therefore, the entire historical database of web pages will be relatively large, accumulating tens of billions of web page information, and the amount of data can reach hundreds of terabytes or even several petabytes. In the case of such a large amount of data, we also need to achieve low latency of reading and writing milliseconds, which makes it difficult for traditional databases to meet the demand. 
How to judge whether it has been crawled before before the crawler crawls the webpage? For ordinary webpages, public opinion cares about their timeliness. Maybe we only want to crawl the same webpage once, then we can use the webpage address to crawl Heavy, reduce unnecessary waste of web page resources. So we need distributed storage to provide efficient random queries based on web pages. 
How to perform real-time structured extraction after the new original webpage is stored, and store the extraction results. Here our original web page may include various html tags, we need to remove these html tags, and extract the title, author, publishing time, etc. of the article. These contents provide the necessary structured data for subsequent public opinion sentiment analysis. 
How to efficiently connect to the computing platform and stream the newly extracted structured data for real-time computing. Here we need to classify the content of the webpage and message description, perform emotion recognition, and perform statistical analysis of the results after the recognition. Due to the poor timeliness of full-scale analysis and the fact that public opinion often focuses on the latest news and comments, we must do incremental analysis. 
How to provide efficient public opinion search, in addition to subscribing to fixed keyword public opinion, users do some keyword searches. For example, you want to understand some public opinion analysis of new products of competing companies. 
How to realize the real-time push of new public opinion? In order to ensure the timeliness of public opinion, we not only need to persist the results of public opinion analysis, but also support the push of public opinion results. The pushed content is usually the new public opinion that we analyze in real time. 
system structure

In response to the above-mentioned problems, let's introduce how to build a public opinion analysis platform of tens of billions of dollars on the whole network based on various cloud products on Alibaba Cloud. We will focus on the selection of storage products and how to efficiently connect various types of computing. , search platform.

image description

爬虫引擎我们选用ECS,可以根据爬取量决定使用ECS的机器资源数,在每天波峰的时候也可以临时扩容资源进行网页爬取。原始网页爬取下来后,原始网页地址,网页内容写入存储系统。同时如果想避免重复爬取,爬虫引擎抓取之前要根据url列表进行去重。存储引擎需要支持低延时的随机访问查询,确定当前url是否已经存在,如果存在则无需重复抓取。

为了实现网页原始内容的实时抽取,我们需要把新增页面推送至计算平台。之前的架构往往需要做应用层的双写,即原始网页数据入库同时,我们重复写入一份数据进入计算平台。这样的架构会需要我们维护两套写入逻辑。同样的在结构化增量进入舆情分析平台中,也有类似的问题,抽取后的结构化元数据也需要双写进入舆情分析平台。舆情的分析结果也需要一份写入分布式存储,一份推送至搜索平台。到这里我们可以发现,图中的三根红线会带来我们对三个数据源的双写要求。这会加大代码开发工作量,也会导致系统实现,维护变的复杂。每一个数据源的双写需要感知到下游的存在,或者使用消息服务,通过双写消息来做解耦。传统数据库例如mysql支持订阅增量日志binlog,如果分布式存储产品在可以支撑较大访问,存储量的同时也可以提供增量订阅就可以很好的简化我们的架构。

网页数据采集入库后,增量流入我们的计算平台做实时的元数据抽取,这里我们可以选用函数计算,当有新增页面需要提取时触发函数计算的托管函数进行网页元数据抽取。抽取后的结果进入存储系统持久化后,同时推送至MaxCompute进行舆情分析,例如情感分析,文本聚类等。这里可能会产生一些舆情报表数据,用户情感数据统计等结果。舆情结果会写入存储系统和搜索引擎,部分报表,阈值报警会被推送给订阅方。搜索引擎的数据提供给在线舆情检索系统使用。

在介绍完整体架构后,下面我们看下在阿里云上如何做存储选型。

存储选型

通过架构介绍我们再总结一下对存储选型的要求:

可以支撑海量数据存储(TB/PB级别),高并发访问(十万TPS~千万TPS),访问延时低。 
业务随着采集订阅的网页源调整,采集量会动态调整。同时一天内,不同时间段爬虫爬下来的网页数也会有明显波峰波谷,所以数据库需要可以弹性扩展,缩容。 
自由的表属性结构,普通网页和社交类平台页面的信息我们需要关注的属性可能会有较大区别。灵活的schema会方便我们做扩展。 
对老数据可以选择自动过期或者分层存储。因为舆情数据往往关注近期热点,老的数据访问频率较低。 
需要有较好的增量通道,可以定期把新增的数据导出至计算平台。上面的图中有三段红色虚线,这三部分都有个共同的特点需要可以实时的把增量导至对应的计算平台做计算,计算后的结果再写入对应的存储引擎。如果数据库引擎本身就支持增量,则可以很大程度简化架构,减少之前需要全量读区筛选增量,或者客户端双写来实现得到增量的逻辑。 
需要可以有较好的搜索解决方案(本身支持或者可以数据无缝对接搜索引擎)。 
有了这些需求后,我们需要使用一款分布式的NoSQL数据来解决海量数据的存储,访问。多个环节的增量数据访问的需求,业务的峰值访问波动进一步确定弹性计费的表格存储是我们在这套架构中的最佳选择。表格存储的架构介绍可以参考表格存储数据模型

TableStore(表格存储)相比同类数据库一个很大的功能优势就是TableStore(表格存储)有较完善的增量接口,即Stream增量API,Stream的介绍可以参考表格存储Stream概述。场景介绍可以参考Stream应用场景介绍,具体API使用可以参考JAVA SDK Stream。有了Stream接口,我们可以很方便的订阅TableStore(表格存储)的所有修改操作,也就是新增的各类数据。同时我们基于Stream打造了很多数据通道,对接各类下游计算产品,用户甚至不需要直接调用Stream API,使用我们的通道直接在下游订阅增量数据,自然的接入了整个阿里云的计算生态。针对上面架构中提到的函数计算,MaxCompute,ElasticSearch和DataV,TableStore(表格存储)都已经支持,具体使用可以参考:

Stream和函数计算对接 
Stream和MaxCompute 
Stream和Elasticsearch 
通过DataV展示表格存储的数据 
TableStore(表格存储)在属性列上,是自由的表结构。针对舆情分析这个场景,随着舆情分析算法的升级我们可能会新增属性字段,同时针对普通网页和微博这类社交页面的属性也可能不尽相同。所以自由表结构相比传统数据库可以很好的匹配我们这个需求。

In the architecture, we have three repository requirements. They are the original page library, the structured metadata database and the public opinion result library. The first two are generally an offline storage and analysis library, and the last one is an online database. They have different requirements for access performance and storage cost. Table Store has two types of instance types that support storage tiering, high performance and capacity. High performance is suitable for scenarios with multiple writes and multiple reads, that is, as online business storage. The capacity type is suitable for scenarios with more writes and fewer reads, that is, offline business storage. Their single-line write latency can be controlled within 10 milliseconds, and read high performance can be maintained at the millisecond level. TableStore also supports TTL and sets the table-level data expiration time. According to the demand, we can set the TTL of the public opinion results, only provide the query of recent data, and the older public opinion will be automatically expired and deleted.

With these features of TableStore (table storage), the six requirements of the system for storage selection can be well satisfied. Based on TableStore (table storage), the entire network public opinion storage and analysis system can be perfectly designed and implemented.

postscript

This article summarizes the storage and analysis problems encountered in the scenario of mass data public opinion analysis, and introduces how to use Alibaba Cloud's self-developed TableStore (table storage) on the premise of meeting the basic data volume of the business, through Stream The docking of the interface and the computing platform realizes the simplification of the architecture. TableStore (Table Store) is a professional-grade distributed NoSQL database independently developed by Alibaba Cloud. It is a high-performance, low-cost, easy-to-expand, fully managed semi-structured data storage platform based on shared storage. One of the important applications in the field of data processing. For other scenarios, please refer to TableStore Advanced Road.

Author: Yu Heng

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326215752&siteId=291194637