Zheng Dong from NetEase: Those things about data collection and analysis - from data burying to AB testing

This article is published by  NetEase Cloud .  

 

On the evening of April 8th, DTalk invited Mr. Zheng Dong, the person in charge of NetEase Internet Analysis Products and Visual BI Products, to conduct a theme on "NetEase Zheng Dong: The First Bullets of Data Collection and Analysis: Data Chapter" share. The sharing consists of two parts. The first part is Mr. Zheng Dong's sharing of concerns about data collection and analysis, and the second part is the Q&A interactive session.

Mr. Zheng Dong, head of NetEase Internet Analysis Products and Visual BI Products. He has been engaged in big data technology related work for many years. Currently, he manages two data analysis product lines of Internet analysis and agile BI in NetEase. He has rich experience in big data technology, Internet business data system construction, and team management. He has been responsible for the data system construction of various businesses and products of NetEase, and has practical experience in application analysis, marketing monitoring, user behavior analysis, visual analysis and other data products.

 

The following is Mr. Zheng Dong's sharing on data collection and analysis:

 

1. Why do enterprises need a complete set of user behavior tracking and analysis platforms

During the product start-up period, it is necessary to analyze the behavior of angel users to improve the product, and even obtain new ideas or discoveries from user behavior to adjust the product direction; product growth process, through multi-angle (multi-dimensional) analysis of user behavior, to user groups. Classification and analysis and comparison of corresponding behavioral characteristics to guide product design, operation activities, and evaluate the effect of market channels.

 

With the A/B test platform, the iteration of the product can be accelerated and real feedback from users can be obtained faster. At the same time, the precipitation of these data can also play a role in promoting the construction of data warehouses and data intelligent applications in the business. For example, for real-time recommendation, it is necessary to obtain as much and detailed behavior data of users as possible faster; for user classification, Machine learning services such as willingness prediction require cleaned standardized and structured data for training.

 

To be able to analyze user behavior, a set of infrastructure for user behavior data collection, transmission, processing, and analysis is required, and the tracking and analysis platform is doing this. Most products in the industry collect user behavior data through SDKs embedded in multiple terminals, and the subsequent transmission, processing and other processes are transparent to the demander, so that the collection, cleaning, and precipitation of data can be collected at a very low cost. Get rid of work, save costs for enterprises, and improve data-driven efficiency.

 

On the analysis platform, the user's behavior definition will be identified by a specific Event, such as "buttonClick", "playMusic" and so on. Usually these events are set by the developer by calling the API provided by the SDK. In addition to determining the name of the event, you can also add custom parameters and values ​​required for analysis. This process is the "buying point" work. Of course, there are also some tools/products that support visual tracking. This method does not require development and intervention, and the SDK will automatically collect the user's behavior on each terminal.

2. What are the differences between code embedding, visual embedding and no embedding, and how to choose during use?

Visual tracking means that in addition to integrating the acquisition SDK, developers do not need to write additional tracking code. Instead, business personnel can "circle" the controls that need to capture user behavior by accessing the circle selection function of the analysis platform. Give the event a name. After the circle selection is completed, these configurations will be synchronized to each user's terminal, and the collection SDK will automatically collect and send user behavior data according to the circled configuration.

 

No burying point means that after the developer integrates the acquisition SDK, the SDK directly starts to capture and monitor all the user's behavior in the application, and send them all to the analysis platform, without the need for the developer to add additional code. During the analysis, the business personnel select the user behavior they are concerned about through the circle selection function of the analysis platform, and give the event name. A multidimensional analysis of specific user behaviors (events) can then be performed.

 

Visual embedded points and no embedded points are similar. They do not require developers to manually add code, and both require business personnel to circle and select the user behavior they are concerned about. The biggest difference between the two is in the performance of the user terminal. Visual tracking only collects user behavior data that business personnel are concerned about, while no tracking collects the behavior data of all users. Usually, the latter is much larger than the former. .

 

It is precisely because no buried point collects all user behavior data by default, it can perform retrospective analysis of events, that is, after business personnel newly define (circle) an event, they can analyze the data of the event in the previous one or two months. In some cases, this is also not supported by visualization and code embedding. However, the problem is that collecting all the data will intrude into the application a bit, and it will also increase the amount of data collected by the client. Of course, these problems can be alleviated through some strategies, such as Wi-Fi.

 

A big flaw of no-embedding and visualization-embedding is that they all use the SDK to monitor the trigger events of the controls on the application (user operations on the controls). When the product UI changes during the version upgrade process, or the product has undergone major changes After revision, the "buried points" of some behaviors will be lost. For example, if the control ID changes, but the configuration of the circle selection does not change, the data cannot be collected; or there are changes that are inconsistent with the actual needs of the business, for example, the function of the circle selection control has changed, but the circle selection configuration has not changed; these problems It will lead to errors in the analysis of certain aspects of the product, which is often troublesome to check and difficult to completely solve technically.

 

In addition, both visual and non-buried points are aimed at client-side data collection. Some user behavior data cannot be collected on the client-side, or the accuracy of client-side collection is not enough, such as payment, because the judgment of successful payment is overwhelming. Most scenarios are done on the server side, so when the payment behavior is buried on the client side, the error is very large. At this time, it needs to be buried on the server side.

 

The advice I gave in business selection is that in the early stage of the product, when the product form is not stable and the complexity of analysis is still relatively low, use no embedding or visual embedding, and do embedding faster, otherwise frequent Product changes will make developers spend a lot of time on trivial buried code maintenance. After the product enters the stable period, try to use the code embedding method to ensure that the event model is stable, which is convenient for long-term data monitoring, analysis and data precipitation.

 

3. What work has been done in practice to promote the implementation of the buried work for better maintenance and management?

 

Product business data-driven workflows often look like this:

 

  1. Define the milestones of the product;
  2. Plan and define metrics, including product, operational, and market goals;
  3. Product, operation and other business personnel determine the data buried point requirements;
  4. Developers carry out development work such as buried points and data reporting;
  5. Data developers are responsible for data cleaning, wide table construction, index calculation, etc.;
  6. Business people analyze data and identify product problems or potential opportunities;
  7. Continue to the next phase of product, operations, market, etc. improvement work.

 

The goal of the user behavior analysis platform is to simplify and automate the work in stages 4-6, freeing developers to do more work that is valuable to the business. The work in Parts 1-3 does not seem complicated. It is completed based on the business status quo to define indicators, exhaust the buried point requirements, and confirm with the developers. But from a practical point of view, many companies or businesses are not doing well enough.

 

The number of tracking events is expanding rapidly, and most people in the team may not know what some tracking points do; or business personnel have defined tracking requirements, but developers have made mistakes in tracking and haven’t found them for a long time, leading to the analysis process. Misinterpretation occurs and affects decision-making.

 

There are a few things this piece can do:

 

  • The indicator management system is used to maintain the data tables, fields and calculation methods that indicators depend on, so as to unify the caliber of the development, analysis and interpretation process.

 

  • Embedding point management system, used to manage the metadata of burying points, including the naming of events, the definition of custom fields and specific values, etc., the location of burying points on the product side or triggering scenarios, and the workflow of burying points, etc., as a business A bridge and benchmark for people, developers, analysts to communicate.
  • Embedding test and verification system, providing debug tools to facilitate developers to quickly debug burying points, and use the specification requirements defined by events to verify the burying point data online, find data that does not meet the specifications as soon as possible, and improve the burying point work. efficiency and accuracy.

 

The summary is: metadata management system + testing and verification tools

 

4. How to do a good job in the coordination and implementation of the buried work and R&D?

 

In practice, many developers are reluctant to do the work of "burying points", which is very trivial, and with the development of the product, the burden will sometimes increase, and the maintenance workload is not small.

 

To make the buried point work better in R&D, the most important thing to improve is how to simplify the work of developers, including development costs and communication costs.

 

There is a complete management system for buried points, so that the R&D side can develop according to the basis, reduce the inefficiency and rework caused by "word of mouth", and can also unify the caliber and progress process. There is an efficient and easy-to-use embedded testing and verification system. Developers can quickly debug embedded points, improve development efficiency, and allow business parties to intervene in requirement verification as soon as possible, instead of waiting for the application to be released before verifying. problem found.

 

Of course, it is best to continuously share with developers how data promotes business development, so that everyone understands the value of these tasks, so that they can pay more attention and take this part of the work more seriously.

 

5. How to better cooperate with embedded data collection and enterprise data asset construction?

During the construction of the user behavior analysis platform, the data terminal will include the following capabilities:

 

  • For data access, it is necessary to support data collection from multiple terminals such as client, web, and server, such as iOS, Android, WeChat applet, etc., as well as data adaptation of various data sources and even third-party services.
  • Data transmission, in the process of user scale and data scale growth, must be able to ensure the high availability of data transmission services and the timeliness of collected data in the transmission process.
  • Data modeling/storage requires real-time data cleaning, modeling, and storage implementation.

 

These capabilities can play the role of infrastructure in the process of building data assets for Internet services, especially in areas related to users, traffic, and products. Standardized data collection, coupled with efficient transmission and modeling capabilities, are the prerequisites for the effective construction of enterprise business data assets.

 

The modeled data can be used as a wide table at the bottom layer of the data warehouse (ODS layer), and integrated with other business data of the enterprise to jointly improve the construction of the enterprise's data assets.

 

On the other hand, these user-side structured data, coupled with real-time modeling and open capabilities, are combined with machine learning algorithms, whether it is personalized recommendation, precision marketing, or risk control of banks and e-commerce. It can exert great power to accumulate data and clear obstacles for the intelligent-driven business of enterprises.

 

Take DMP (user portrait) construction as an example:

 

In the process of building their own DMP library, enterprises often start with quasi-static class labels such as conventional population attributes, and general class labels such as consumption ability obtained from their own business accumulation or tripartite cooperation. These labels are often pan-business. For specific businesses, user portrait labels are often required to be closer to the business, such as maternal and infant users in e-commerce business scenarios, electronic product enthusiasts, and cosmetic brand preference users. The discovery of these tags and users requires an in-depth analysis of the user's behavior. This work can be obtained by using the capabilities of the user behavior analysis platform, such as group analysis and comparison of users based on user behavior patterns and user business attributes. Mining valuable user tags.

 

On the other hand, the data of user portraits can also be integrated and integrated with the analysis platform to improve the insight capabilities of each analysis model of the platform for different user groups, make the comparison of analysis and indicators more targeted, and improve the ability of data to promote business. .

6. How can the tracking and analysis platform and the A/B test platform better promote each other?

 

A/B testing products is to help products verify and analyze product decisions by providing a professional and efficient test platform. The general usage process is as follows:

 

Access SDK -> create test version -> set variables and optimize indicators -> adjust test traffic -> run test -> real-time monitoring data for effect evaluation -> official release

 

The SDKs of the test platform and the analysis platform overlap in many functions, and can be integrated in the SDK implementation, reducing the burden of business applications accessing too many SDKs.

 

At the level of data collection, modeling, and analysis, the analysis platform can be used as the back-end data carrier of the A/B test platform, and the evaluation of the effect of optimization indicators can cover the full amount of user behavior, without the need for business and developers to maintain multiple tools. Duplicate buried point definition and development work. In addition, many analysis models and indicators accumulated in the analysis platform can be directly selected and used in the A/B test platform, and there is no need to set up on the test platform. In addition to reducing the work of business personnel, it can also ensure the consistency of statistical caliber.

 

Conversely, some comparative experiments on the A/B test platform, as well as user groups released by specific grayscales, can also be integrated into the analysis platform, and through group analysis capabilities, these groups can be applied to various analysis models for targeted analysis and even experiments. After the end, these users can also be continuously tracked and analyzed to gain better insights into users.

7. How to get through the multi-end embedded data of the product?

This is an attribution issue. Generally speaking, when the account is opened, there will be a discussion of attribution.

 

In current analysis products, under normal circumstances, the mobile terminal will generate a unique ID through the SDK to identify the user/device. In the early days of mobile development, many acquisition tools used mac address, IDFA, android_id, IMEI and other device software and hardware information that can be obtained from mobile operating systems to identify devices. Or has lost precision. On the contrary, it is a tool that identifies the user through the ID generated by itself from the beginning, which is not affected much, and basically maintains the stability of the user/device identification.

 

However, there is a problem with this method. After the user uninstalls, reinstalls or flashes the computer, the ID information will be lost, resulting in the generation of a new user/device ID.

 

We have used ID Mapping technology to get through the ID: generate a virtual ID for each user, map multiple devices and accounts of the same user, and bind them.

 

  • Mapping can be done through some indicators provided by the operating system that are less stable but relatively stable in a short period of time, such as IDFA of iOS.
  • By analyzing the application coverage of the product, if the user is a user of application A and application B, after uninstalling and reinstalling application B, application B can be repaired through the ID of application A.
  • By introducing the product user account system for binding, this method is the most stable, but the problem of non-login anonymous users is not easy to solve.
  • Mapping is performed by IP, Wi-Fi information, machine model, and even geographic location. This method requires the user to authorize more data access rights. Although it is an approximate match, when there is enough information and divergence (information entropy is large enough), it is also It can play the role of unified identification.

 

Through this virtual ID, the multi-end data of the product is actually opened up. The construction workload of the ID Mapping system is not small. If the user ID needs to be adjusted after the mapping, the old data needs to be rewritten in the event-based analysis product, which is more complicated. Therefore, for some products with a strong account system, it can be degraded to only use user accounts for association, and only non-login anonymous users are identified by device IDs, which is often a cost-effective solution.

 

Promotion channel attribution is easy.

 

Analysis platforms that support marketing effect evaluation will require products to generate promotion links on the platform for delivery. When the user clicks the link, they will jump from the domain of the analysis platform to the target page, so that the browser's cookie mechanism can be used for matching to attribute the source of the user. Not very good (iOS has removed support for sharing cookies across multiple apps in SFSafariViewController). In addition, the approximate matching technology mentioned by ID Mapping can also be used. Most of the device fingerprinting technologies claimed by many manufacturers are also similar, which are not very accurate, but qualitative analysis is possible.

 

For attribution, some promotion channels have done some work to solve the problem that the mobile terminal is not easy to trace the source: support the return function of the device ID to facilitate the solution of the product attribution problem.

 

When placing links, the product side can follow a specific format

After the user clicks on the advertisement link, the channel will add the device ID such as IDFA or IMEI to the content of the link. After the user activates it, it can be attributed by matching the corresponding ID.

 

 

The following is the content of Mr. Zheng Dong's answer to the question (excerpt)

 

Q1. "Solution-Product-AI direction: I want to know if the current method is being used by related products, right? So, apart from being more complicated, does ID MAPPING have other defects?"

- - - - - - - - - - - - - - -

ID mapping is quite complicated, but basically it is necessary to build a user portrait with a relatively large coverage.

 

Q2, "Wang Yongtao@d s: @zheng Dong, may I ask Mr. Zheng, is this method applicable to both iOS and android?"

- - - - - - - - - - - - - - -

are applicable.

 

Q3. "Shuihanzhu-Government-Enterprise-Data Analysis: Teacher, is the visualization point based on the original demo?"

- - - - - - - - - - - - - - -

The punctuation at the UI level is fine, as long as the SDK supports it.

 

Q4, "Ben: Teacher, when I was reading the relevant information about tracking points, I saw that if there are too many tracking points, the response speed of the product will be slower and the user experience will be affected. May I ask under what circumstances will this happen?"

- - - - - - - - - - - - - - -

Buried points are generally not, most of them are sent asynchronously, and the amount of data is relatively small. For example, when you listen to a song and open a details page above, you can send thousands of buried points data. If it is slow, it may be that the sdk is not implemented well, for example, there are too many system calls to hook, or there is a problem with the application itself.

 

Q5. "Shuihanzhu-Government and Enterprise-Data Analysis: The timing and corresponding parameters will be involved in burying points. How are these expressed?"

- - - - - - - - - - - - - - -

The parameters of buried points must be managed through metadata. In fact, many product students in our business have included them in the interactive draft in order to prevent different understandings from development.

 

This article is reproduced from the public account: DTalks

Original link: [DTalk Essence] Netease Zheng Dong: Those things about data collection and analysis: from data burying to AB testing

 

NetEase Big Data:

NetEase Mammoth- NetEase Big Data | Professional privatized big data platform

NetEase has a number - NetEase Big Data | Professional privatized big data platform, free trial available

 

Learn about NetEase Cloud:
NetEase Cloud Official Website: https://www.163yun.com/
New User Gift Package: https://www.163yun.com/gift
NetEase Cloud Community: https://sq.163yun.com/

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325216394&siteId=291194637