TOP100summit: [Shared Record-WalmartLabs] Using Open Source Big Data Technology to Build a WMX Advertising Benefit Analysis Platform

The content of this article comes from the case sharing of Su Difu, chief engineer and architect of the TOP100summit WalmartLabs laboratory advertising platform in 2016.

Editor: Cynthia

 

Su Difu: Chief Engineer and Architect of WalmartLabs Advertising Platform

He has rich experience in big data platform architecture design, message middleware, distributed systems and other fields.

As the technical leader, he has helped many companies build big data platforms and distributed systems.

Currently leading the development of WMX big data platform, advertising benefit analysis system and real-time data pipeline.

 

Introduction: As the world's largest commodity retailer, Walmart places a large number of advertisements, generates a large number of commodity transactions, and generates a large amount of data every day. strategy to help advertisers effectively place advertisements to promote product sales. Combined with the specific case of measuring the effectiveness of Facebook advertising, this article describes how the Walmart WMX team used open source technology to develop the WMX advertising benefit analysis platform, support rapid algorithm iteration, and continuously update big data technology to improve system performance and operating efficiency, improve software quality, and improve The knowledge level of the team.

 

one. statement of problem

Walmart has numerous retail stores and online sales channels. When a customer purchases an item, the transaction and customer information are recorded. The customer information is sorted and classified to form Wal-Mart's user background information, such as address, gender, age, credit card, education, marriage, hobbies, and consumption habits. Through user analysis, we can connect the user's real identity with the network identity, and can also generate useful information such as individual users and family users.

To help suppliers promote their products, Walmart conducts advertising campaigns. In advertising promotion activities, for certain products, select appropriate users as audiences based on user background information, create advertisements, and place them on selected advertising channels, such as mobile APP, email, Walmart website, social media, search engines, news website etc.

It can be seen that the data of advertising promotion activities are high-dimensional, such as:

● User data: address, income, expenses, education, marriage, gender, age

● Ad data: graphic format, size, clickability, location

● Product data: product type, attributes, promotions, discounts

● Release channel data: URL, website, app

● Display data: time, device, location

When generating benefit analysis reports and measuring the effectiveness of advertising campaigns, we need to select audiences and aggregate sales revenue information by any combination of dimensions. 

A common method to measure the effectiveness of advertising campaigns is A/B testing. That is, users are divided into group A and group B, users in group A are the audience of advertising promotion activities, and users in group B are not the audience. Comparing the transaction amount of users A and B can get the effect of advertising promotion activities. 

There are three difficulties in generating the benefit analysis report.

● First, high-dimensional data leads to a large amount of data generated during data connection.

● Second, aggregation by any dimension will generate a large number of reports.

● Third, the matching algorithm for advertising promotion activities and transactions is not unique, and algorithm testing and iteration are required to select the optimal algorithm.

The original system mainly relies on HiveQL to connect a large amount of data, then filter, and generate a report for each aggregation method; each report reruns the entire process. This approach is complex and inefficient, cannot reuse intermediate results, and is difficult to improve.  

This requires us to develop an advertising benefit analysis platform to overcome the above shortcomings, which can efficiently generate benefit analysis reports aggregated by any dimension, and perform algorithm iterations quickly.

 

two. System Architecture and Technology Evolution

This section explains the design and technical evolution of the system architecture based on the Facebook advertising case.

  

Figure 1 shows the relationship between Walmart's advertising campaign data on Facebook. An advertising campaign starts by creating an ad (Creative) for an item (Item Type). The advertising campaign (Campaign) begins with the selection of suitable Walmart individual users (Individual) as the audience (Audience) according to the requirements. The audience is divided into two groups, A/B (Bucket), and conduct AB testing. First upload the user information to Facebook. After Facebook displays ads, download Facebook's ad display data (Facebook User).  

Walmart personal users are divided into two categories:

● The Customer completes the transaction (Transaction) on the online store (Online); the Living Unit completes the transaction in the physical store (Store).

Merchandising and advertising campaigns are linked by user, product, and time. 

We combine advertising display data and transaction data to generate an advertising benefit analysis report: compare the total transaction amount of goods between the two groups of audiences A/B. The data processing process generated by the advertising benefit analysis report is as follows:

● User Mapping: Find the correspondence between Facebook User, Individual, Customer, Living Unit, Audience, and Bucket.

● Data connection: connect advertising promotion activities, audiences, users, commodities, transactions, and generate big data tables.

● Transaction attribution: analyze the relationship between advertising promotion activities and transactions, and attribute the transaction to an advertising promotion activity.

● Group Aggregation: Calculate transaction amount by grouping data as required.

The difficulty is:

● When a large data table is generated by data connection, a large amount of data will be generated, and the data must be filtered in time.

● Transaction attribution is not unique. A deal may be associated with multiple advertising campaigns, and we need to test multiple algorithms to optimize deal attribution.

● We need to group and aggregate by any dimension to generate high-dimensional reports. There are too many grouping combinations for high dimensions, and it is impractical to generate a report for each combination.    

In response to the above requirements and difficulties, we designed a new system architecture: a dynamically scalable modular data pipeline. Based on this new system architecture, the Facebook advertising benefit analysis system is shown in Figure 2.

 

The whole advertising system consists of three parts: data collection system, advertising benefit analysis system and report query system. Our focus is on advertising benefit analysis systems. 

The advertising benefit analysis system reads the data from the data collection system and generates the advertising benefit analysis report, and the customer queries the advertising benefit analysis report through the report query system. We extend the abstraction of Figure 2 into a platform, which is the dynamically scalable modular data pipeline shown in Figure 3. 

The dynamically scalable modular data pipeline is formed by connecting multiple MapReduce modules; similar to the microservice architecture, it decomposes a complex data processing into multiple steps. Each module reads data from the database and stores the processed results in the database. The processing result of the previous module is the input of the next module.  

The structure of a MapReduce module is shown in Figure 4. It consists of three parts: a Job, multiple Mappers and a Reducer.

● Job is responsible for the composition and control of the configuration module;

● Mapper's Parser is responsible for reading data;

● Strategies consists of multiple processing units, and data is processed by these units in turn to complete tasks such as data filtering, adding, correction, and conversion;

● Auxiliary data is stored in Distributed Cache;

● Collector serializes the processing result and outputs it to Reducer for use.

● Each Mapper processes one data source, and multiple data sources require multiple Mappers for processing.

● Strategies of Reducer, like Mapper, are composed of multiple data processing units to complete tasks such as data connection, aggregation, and calculation;

● Collector stores the processing results in the database.

 

The dynamically scalable modular data pipeline distributes complex data processing to multiple data processing modules, each module completes an independent logical function processing, and stores the results in the database. The single logic function of the module is helpful for us to find the system bottleneck and optimize it. Such an architecture enables a system to coexist with multiple technologies. We are able to select the most appropriate technology for each module based on the amount of data and speed required by each module.

For example, for a module with a small amount of data, we can choose Spark as the processing technology to process quickly. For the big data processing module, we can use Hadoop's MapReduce for fine data filtering processing. The Strategies inside each module are configurable, and we only need to replace different Attribution algorithms to quickly complete the algorithm iteration. Since intermediate results are stored in the database, they can be easily reused. Another benefit of modularity is that we can easily test new big data technologies and speed up the process of technological evolution.

Walmart's big data infrastructure includes Hadoop, MapReduce, Spark, HDFS, S3, Cassandra, Hive, Kafka, Pig, Logstash. in:

● The computing platform for parallel processing is Hadoop, and interfaces such as HiveQL, MapReduce, and Spark are provided on top of Hadoop;

● HiveQL provides SQL-like data query and access, but the efficiency is relatively low; MapReduce provides fine programming control functions;

● Spark provides set-oriented processing functions and data stream processing capabilities. Its data is mainly stored in memory, so the processing speed is much faster than that of Hadoop-based MapReduce.

Spark is an emerging big data technology, and its performance is not stable enough. If the data size and memory do not match, it is easy to cause task interruption. According to the architecture of our system, we choose MapReduce and Spark as our main supporting technologies. Modules with a small amount of data are implemented with Spark, and modules with a large amount of data are implemented with MapReduce. 

For big data storage, you can choose HDFS, S3, Cassandra, and Hive.

● Cassandra provides fast Key-Value storage with fast access speed but relatively high price, which is not suitable for the data volume of dozens of terabytes per day;

● Hive is a SQL database based on HDFS, which can be used for MapReduce and Spark;

● S3 can provide big data storage of the same magnitude as HDFS;

We ended up choosing HDFS and Hive storage formats.

The dynamically extensible modular data pipeline generates a high-dimensional table, without classification and aggregation, but the data table is handed over to Druid for index processing. Druid is an open source data storage and online analysis and processing tool that can quickly complete data aggregation queries of any dimension. This tool greatly reduces the number of reports we need to generate.

 

three. Code Review-Centric Software Quality Management  

In software development, we hope to improve software quality, team collaboration and development efficiency. The open source software development technologies we use include: Linux, Java, Scala, Python, Hadoop, Spark, Junit, Maven, code sharing management tool Git, code review tool Gerrit, shared document Confluence, CI tool Jenkins, project management tool Jira. We take the Agile software development management approach, using Jira as a project management tool for two weeks each Spring. The main programming languages ​​are Java, Scala.  

In the implementation process of the project, the software development centered on code review is more characteristic. The code development process is: programming, committing, reviewing, and merging. If the review fails, it needs to be programmed, submitted, and reviewed again. If the review passes, the code can be merged. 

First, we organize the code in the smallest logical functional unit. Each submitted code completes an independent logical function. For new functions, we develop the control framework first, and then develop each sub-function. For multiple submodules with common modules, we first develop the advertising module, and then apply the common module to implement the submodules. For bug fixes, we also divide a bug logically. This not only makes the implementation logic clear and reduces the occurrence of bugs, but also helps code reviewers understand the code and maintain code in the future.

We use Git for code sharing and Gerrit for code reviews. After the code is submitted, the relevant developers will be notified by email for review. Code reviewers can give code a score of -2 to +2. Code accumulation can only be merged with +4 points. In general, any questions about the code can be asked and scored -1. After the code committer answers the question, the reviewer can change the score.

Second, we intensified unit testing, and we require tests to achieve 100% code coverage except for some hard-to-test code. Code must be successfully compiled and unit tested before it can be submitted. We use open source software such as Junit, Mockoti, Harmcrest, Truth, Jacoco for unit testing. Through unit testing, we eliminated a large number of bugs during the development phase. 

Third, we reuse software as much as possible, including open source code repositories such as Apache, Hadoop, Guava, and Protobuf, Walmartlabs' public code repositories, and tested code. Also, whenever possible, extract common code as common modules. This method reduces the workload of development, improves software quality, and reduces the workload of software maintenance. 

Fourth, we leverage the Jenkins server for continuous integration. Automatically trigger continuous integration when code is submitted. Code can only be merged if continuous integration is successful.  

Software development centered on code review changes the software development process, forcing developers to clarify the logic of the features to be developed before coding. Strict code review enables a large number of bugs to be eliminated during the code development stage, and improves the readability and maintainability of the code. The bug of the system is reduced to 5%, and the average bug repair time is reduced to 1/5. At the same time, it increases team communication, improves team collaboration ability, and has many people to maintain the code, which is also conducive to technical exchanges and the introduction of new technologies.

Software development centered on code review changes the time allocation of software development, with a ratio of 1:2:1 for programming, testing, and reviewing. 

For more TOP100 case information and schedule, please visit [Official Website] .

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326213525&siteId=291194637