Lindorm practice | Internet advertising solutions based on Lindorm


User benefits

Alibaba Cloud has recently released Lindorm, the industry’s first cloud-native multi-mode database. New users can apply for a free trial for the first month to obtain product technical support. Please join the DingTalk group: 35977898. For more content, please refer to the link

1. Background

Advertising, as its name suggests, is to advertise, promote goods or services through advertising. Before mankind entered the Internet age, advertising often relied on traditional newspapers, magazines, television and other mass media, and advertising sales were also carried out offline. When human beings entered the Internet and mobile Internet, the form of advertising sales has undergone earth-shaking changes with the emergence of portal websites and various mobile apps, which has also promoted the continuous development of advertising programmatic purchases, and various related categories Online advertising systems are also constantly evolving.

2. Programmatic buying process and business characteristics of advertising

2.1 The purchase process of an advertisement

The programmatic purchase of advertisements is done through the Ad Exchange (ADX), the Supply-Side Platform (SSP), the Demand-Side Platform (DSP), and the data management platform ( Data Management Platform, abbreviated as DMP) and other emerging Internet technology platforms, organically connect the needs of consumers, advertisers, media and other parties.
The advertising inventory management is realized in the SSP. When the consumer reaches the advertising space, the information is transmitted to ADX, and at the same time, the consumer's portrait data query is completed in the DMP, and the advertising space display needs are determined, and then an invitation is initiated to each DSP to determine Appropriate advertisers then enter the bidding stage to reach an advertising transaction, and finally put the products that advertisers need to promote on the media side and display them in front of consumers.
image
Note: This picture is from the Internet. If there is any infringement, please contact the author to delete it.

2.2 Analysis of Advertising Purchase Process

From the above process of programmatic buying advertising, you can see:

  • The entire advertising purchase link involves many links, the link is very long, but the real-time requirements are very high, usually at the level of 100 milliseconds, otherwise the user experience will be poor, and the timeout request will be detrimental to all parties. : The media’s advertising space has not been sold, advertisers have not shown their products to end users, and DSP/SSP/DMP and other parties have not received commissions because the transaction has not been concluded.

  • In order to realize the "precision marketing" and "thousands of people" of advertising, user profile data plays a vital role, deciding what kind of advertising content to show to the specific user, otherwise it will cause the refrigerator to be sold to the Eskimo People's jokes.

So, what are the possible contents of the data here?

  • User portrait data, that is, some basic attribute data of the user, such as gender, age, address, income status, etc.

  • Behavior data of users' browsing (including duration, etc.), favorites, additional purchases, transactions, etc. on the advertiser's website/store

  • Other user behavior data, such as: channel browsing, business page browsing, comments, scoring, communities & forums, etc.

  • Specific data in scene advertising, such as: user real-time location and time data

  • Data on clicks/interactions of users on ads

  • Offline data from three parties. People's offline behaviors are often more expensive than online, so they tend to have a stronger purpose. For example, going to a 4S store is often an intention to buy a car.

2.3 Data storage characteristics of advertising business systems

  • Can support massive data at low cost: A typical feature of Internet applications is that they have a large number of users, often calculated in tens of millions or even billions, and their browsing/additional purchases/collection and other types of behavioral data, scene data, advertising clicks & Interactive data must also be massive. Massive detailed data needs to be trained through offline model to produce final user profile data, which is hundreds of millions of high-dimensional (hundreds, thousands or even tens of thousands of fields) data.

  • Can support high concurrent read and write and low latency: Mass users generate a large amount of data that needs to be written to the back-end storage system in real time. Therefore, the concurrency of data writing tends to reach tens of thousands, hundreds of thousands or even millions per second. higher. At the same time, advertising is different from a transaction, and it is often directly proportional to the user's browsing behavior. Therefore, it is a more frequent reading behavior, and in an advertising system with a very long link, it is necessary to ensure that it returns within 100 milliseconds.

  • Archiving capabilities are required: user behavior details or other types of data written into the back-end storage are often required to be archived in quasi real-time to the offline system to complete the analysis and output results in order to feed back to the user portrait data as soon as possible.

  • High-efficiency & low-impact data reflow capability is required: the data archived to the offline system is analyzed to generate new profile data, and it needs to be reflowed to the online storage to provide online query under the condition of high efficiency and no impact (on online query business) .

  • Need to have dynamic schema capabilities: As mentioned earlier, the advertising system relies on many data sources, for example: the collection of behavioral data will continue to change, so the table structure will also be constantly changing.

针对上述数据存储的能力要求,同时该类数据没有强事务要求的特点,是否存在一个合适的存储方案呢?

三.面向大数据场景的Lindorm

没有强事务要求、海量数据、高并发&低延迟、准实时归档能力、高效数据回流能力以及动态schema能力,这些种种要求正是阿里云自研NoSQL数据库产品着力要解决的问题。
作为面向大数据场景的半结构化、结构化存储系统,Lindorm已经在阿里发展近十年,并始终保持着快速的能力更新和技术升级,是目前支撑阿里经济体业务的核心数据库产品之一。在过去的岁月,伴随着经济体内部对于海量结构数据存储处理的需求牵引,其在功能、性能、稳定性等方面的诸多创新历经了长时间的大规模实践考验,被全面应用于阿里集团、蚂蚁集团、菜鸟、大文娱等各个业务板块,成为目前为阿里内部数据体量最大、覆盖业务最广的数据库产品。
基于Lindorm存储的用户画像架构可以用下图来描述:
image下面笔者详细阐述下Lindorm的那些特性可以满足以用户画像数据为基础推进的程序化广告购买、投放系统对存储系统的需求。

3.1 低成本

大数据有众所周知5V特征,这其中首当其冲的是Volume,因此面向大数据场景的数据存储解决方案必须具备高密度、低成本的特性。Lindorm是诞生于大数据时代的一款NoSQL数据库,低成本解决海量大数据的高效存、取是根植于其体内的基因。Lindorm的低成本能力体现在:
-多样化存储类型支持
性能型存储、标准型存储、容量型存储,总有一款适合你的业务场景
-深度压缩优化
存储成本最低的系统是没有数据需要存储的系统,但这点显然是不现实的,现实可行的方案是将需要存储的数据降到合理的最低点。为了降低存储开销,Lindorm引入了一种新的无损压缩算法,旨在提供快速压缩,并实现高压缩比。它既不像LZMA和ZPAQ那样追求尽可能高的压缩比,也不像LZ4那样追求极致的压缩速度。这种算法的压缩速度超过200MB/s, 解压速度超过400MB/s(实验室数据),很好的满足Lindorm对吞吐量的需求。经实际场景验证,新的压缩优化下,压缩比相对于LZO有非常显著的提高,存储节省可以达到50%~100%,对于存储型业务,这就意味着最高可以达到50%的成本减少。
-冷热分离
Lindorm具备在单一个存储架构下的“一张表”内实现数据的冷热分离,系统会自动根据用户设置的冷热分界线,自动将表中的冷数据归档到冷存储中。在用户的访问方式上和普通表几乎没有任何差异,在查询的过程中,用户只需配置查询Hint或者Time Range,系统根据条件自动地判断查询应该落在热数据区还是冷数据区。对用户而言始终是一张表,对用户几乎做到完全的透明。

image

3.2 高性能吞吐

根据实测同样规格,相同数据量的情况下,Lindorm不管是在单行读、范围读还是单行写及批量写场景下,其吞吐量和P99延迟相比社区版本HBase2.0都有数倍提升。
image备注:1) P99延迟指99%请求的响应时间小于该值; 2) 图中数值供参考,具体以实际场景为准
下图为以批量写为主的真实业务场景迁移后的表现,而用户画像的行为日志数据采集往往也可以通过累积一定量的数据后做批量写入。

image

3.3 多AZ + Speculative访问

Lindorm提供跨可用区强一致或最终一致不同模式来满足不同业务场景下的高可用及性能要求。对于以用户画像为基础数据的广告场景,对于数据一致性的要求并不高,能保证最终一致即可。在这样的前提下,就可以通过Lindorm提供的Speculative访问方式来大幅度降低单机/集群异常导致的访问毛刺,从而满足广告场景对于响应时间的极高要求。

image

3.4 实时增量归档

实时增量归档是Lindorm的一项独立服务,通过监听Lindorm产生的日志,LTS解析日志并同步到离线系统比如Hadoop或者MaxCompute。同步到离线系统的数据按时间分区,这样可以很方便的进行T+1,H+1或其他不同周期的计算。
image这样的同步机制下,一方面数据归档过程与在线存储解耦,在线读写完全不会受到数据归档的影响。另一方面明细数据可以实现准实时同步到离线,然后进行分析,从而可以高效实现用户画像数据的更新。

3.5 Bulkload技术

与关系型数据库不同,Lindorm采用LSM Tree架构。读取存储到Lindorm里的一条记录需要合并对应数据分片内存中(即memestore)的数据、该数据分片所owner的多个LDFile中该记录的最新版本数据,合并后提交给客户端。基于这样的原理,Lindorm可以实现直接生成并向系统中“插入”新的LDFile,从而实现“新”数据的加载,使得其相比于其他的关系型数据库或NoSQL有非常大的优势。这样的数据加载过程完全绕过了存储引擎,WAL及Memstore等等,只有必不可少的物理IO和网络开销,从而极大的提升了数据加载的性能,降低了对在线业务请求的影响。

3.6 动态列

Lindorm的宽表模型支持多列簇、动态列、TTL、多版本等特性,可以很好的适合用户画像这样表结构不稳定,经常需要进行变更的业务场景。

四.Lindorm核心能力概述

Lindorm通过其具备的全方位、多角度的能力,可以很好的满足用户画像业务大数据量、高并发、实时归档、高效&稳定批量数据加载、动态列及多维度复杂查询的需求。
当然,Lindorm的能力还远不止于此,Lindorm具备了大数据背景下,面向海量数据的存储系统应该具备的一系列的能力:

  • 是一款支持宽表、时序、搜索、文件的多模数据库

  • 是一款基于存储计算分离架构的数据库,提供极致的计算、存储弹性伸缩能力,并将全新提供Serverless服务,实现按需即时弹性、按使用量付费的能力

  • 是一款支持冷热分离、、追求更优压缩优化方案的极具性价比的数据库

  • 是一款具备全局二级索引、多维检索、时序索引等功能的数据库

  • 提供具备智能化服务能力的LDInsight工具,白屏化完成系统管理、数据访问及故障诊断

  • 提供LTS(Lindorm Tunnel Service,原BDS),支持简单易用的数据交换、处理、订阅等能力,满足用户的数据迁移、实时订阅、数湖转存、数仓回流、单元化多活、备份恢复等需求

五.案例

某全球性的媒介投资管理集团

The company manages more than 100 billion U.S. dollars in media investment every year. It has many media experts with in-depth understanding of consumers and media platforms, professional media purchasing capabilities, market-leading brand safety measures, and technical solutions in the global market. Other stakeholders build market advantages.
By migrating the in-memory database couchbase that its advertising system relies on to Lindorm, the company significantly reduces operating costs while achieving a P99 maximum glitch of 5ms through the Speculative access method provided by Lindorm, which satisfies the response of the advertising system. The extreme requirements of time, and at the same time, the architecture has the disaster recovery capability of multi-availability zones.
The system architecture and business process refer to the following figure:

image



Guess you like

Origin blog.51cto.com/15060465/2675047