NetEase Games is based on Flink's streaming ETL construction

Lin Xiaobo, senior development engineer of NetEase Games, brings you an introduction to NetEase Games' Flink-based streaming ETL construction. content include:
  1. business background
  2. Dedicated ETL
  3. EntryX Generic ETL
  4. Tuning Practices
  5. future plan

1. Business Background

NetEase Game ETL Service Overview

The basic data of NetEase games is mainly collected by logs. These logs are usually unstructured or semi-structured data, which can be stored in real-time or offline data warehouses only after data integration ETL. After that, business users can easily use SQL to complete most data calculations, including real-time Flink SQL and offline Hive or Spark.

The data flow of NetEase game data integration is similar to that of most companies, mainly including game client logs, game server logs and other peripheral logs, such as Nginx access log, database log and so on. These logs will be collected into the unified Kafka data pipeline, and then written to the Hive offline data warehouse or the Kafka real-time data warehouse through the ETL inbound service.

This is a very common architecture, but there are some special cases in our requirements.

NetEase game streaming ETL requirements characteristics

First of all, unlike relational databases such as MySQL and Postgres that are commonly used in the Internet, finance and other industries, the game industry often uses schema-free document databases such as MongoDB. The problem that this brings to our ETL service is that there is no accurate schema for online business to rely on. In actual data processing, there are many fields or few fields, and even one field is changed to a completely different format due to iterative gameplay. Such Situations are possible. Such data heterogeneity problem brings relatively high cost to our ETL data cleaning.

Secondly, due to database selection, the database schemas of most businesses follow the anti-paradigm design, and complex embedded fields are deliberately used to avoid joins between tables. One advantage of this situation is that we do not need to join multiple data streams in real time during the data integration phase. The disadvantage is that the data structure may be very complex, and multi-level nesting is very common.

Then, due to the popularity of real-time data warehouses in recent years, we are also gradually building real-time data warehouses. Therefore, it is a natural development direction to reuse existing ETL pipelines, extract and convert once, and load them into two real-time offline data warehouses. .

Finally, our log types are many and change frequently. For example, a game with complex gameplay may have more than 1,000 log types, and there may be a release every two weeks. In such a context, abnormal data in ETL is inevitable. Therefore, we need to provide comprehensive exception handling, so that the business can learn about data exceptions in a timely manner and repair data through processes.

Log classification and characteristics

In order to better optimize for different business usage patterns, we provide different services for businesses with different log types. Our logs are generally divided into three types: operational logs, business logs, and program logs.

The operation log records player behavior events, such as logging in to an account, receiving gift packages, etc. This type of log is the most important log and has a fixed format, that is, the text format of a specific header + json. The main purpose of the data is to make data reports, data analysis, and in-game recommendations, such as player team matching recommendations.

Business logs record business events other than player behavior, which are more extensive, such as Nginx access log, CDN download log, etc. These have no fixed format at all, and may be binary or text. The main purpose is similar to the operational log, but more enriched and customized.

The program log record is the operation of the program, that is, logs such as INFO and ERROR that we usually print through the log framework. The main purpose of the program log is to retrieve and locate running problems. Usually, it is written to ES, but sometimes it is also written to the data warehouse when the number is too large or it is necessary to extract indicators for analysis.

Analysis of NetEase Game ETL Service

For these log classifications, we specifically provide three types of ETL storage services. The first is the ETL dedicated to the operational log, which is customized according to the mode of the operational log. Then there is the generic EntryX ETL service for text logs, which will serve all logs except operational logs. Finally, there are special ETL requirements that EntryX cannot support, such as data that is encrypted or needs to be specially converted. In this case, we will develop ad-hoc jobs to deal with them.

2. Dedicated ETL for Operational Logs

Operation Log ETL Development History

The Operational Logging ETL service has a long history. Around 2013, NetEase Games established the first version of the offline ETL framework based on Hadoop Streaming + Python preprocessing/postprocessing. This framework is running smoothly for many years.

In 2017, with the emergence of Spark Streaming, we developed the second version based on Spark Streaming, which is equivalent to a POC, but it was not applied online due to difficulties in micro-batch tuning and many small files.

In 2018, when Flink was relatively mature, we also decided to migrate our business to Flink, so we naturally developed the third version of the operation log ETL service based on Flink DataStream. The special point here is that our business side has accumulated a lot of Python ETL scripts for a long time, and the most important point of the new version is to support the seamless migration of these Python UDFs.

Operational Log ETL Architecture

Next, let's take a look at the architecture comparison of the two versions.

In earlier versions of Hadoop Streaming, the data was first dumped to HDFS, and then Hadoop Streaming started Mapper to read the data and pass it to the Python script via standard input. The Python script will be divided into three modules: first, the UDF is preprocessed, and string-based replacement is usually performed here, which is generally used as normalized data. For example, the time format of some overseas partners may be different from ours, so you can do it here. Unite. The preprocessed data will enter the general parsing/transformation module, where we will parse the data according to the format of the operation log, and perform general transformation, such as filtering out test server data. After the general module, there is finally a post-processing module to perform field-specific conversions, such as common exchange rate conversions. After that, the data will be returned to Mapper through standard output, and then Mapper will write the data to the Hive directory in batches.

After we reconstructed with Flink, the data source was changed from HDFS to directly connect to Kafka, and the IO module used Flink's Source/Sink Operator to replace the original Mapper, and then the intermediate general module can be directly rewritten to Java, and the remaining preprocessing and postprocessing is where we need to support Python UDFs.

Python UDF implementation

In terms of specific implementation, we added the Runner layer on top of Flink ProcessFunction, and the Runner layer is responsible for cross-language execution. In terms of technology selection, Jython was chosen instead of Py4j, mainly because Jython can directly complete the calculation in the JVM, without the need to start additional Python processes, so the development and operation and maintenance management costs are relatively low. The limitations brought by Jython, such as not supporting c-based libraries such as pandas, are acceptable for our Python UDF.

The whole call chain is that ProcessFunction will initialize the Runner lazily in the open function when the TaskManager is called, because Jython is not serializable. When the Runner is initialized, it will be responsible for resource preparation, including adding the dependent modules to the PYTHONPATH, and then calling the UDF function reflectively according to the configuration.

When called, the preprocessing UDF Runner will convert the string into Jython's PyUnicode type, and for the post-processing UDF, the parsed Map object will be converted into Jython's PyDcitionary, which are used as the input of both. UDF can call other modules for calculation, and finally return PyObject, and then Runner converts it into Java String or Map, and returns it to ProcessFunction for output.

Operational Log ETL Runtime

Just a partial view of the UDF module, let's take a look at the overall ETL job view. First, we provide a common Flink jar. When we generate and submit an ETL job to the job platform, the scheduler will execute the common main function to build the Flink JobGraph. At this time, the ETL configuration will be pulled from our configuration center, which is ConfigServer. The ETL configuration contains the used Python modules, and the back-end service will scan other modules referenced therein, and upload them to HDFS as resource files through the YARN distribution function. When the Flink JobManager and TaskManager are started, these Python resources will be automatically synchronized to the working directory by YARN for backup. This is the entire job initialization process.

Then, because the small changes of ETL rules are very frequent, such as adding a new field or changing the filter conditions, if we need to restart the job every time we change, the unavailable time brought by the restart of the job will cause comparison for our downstream users. bad experience. Therefore, we have classified the changes and support hot update for some lightweight changes that do not affect the Flink JobGraph. The way to achieve this is that each TaskManager starts a hot update thread and periodically polls the configuration center to synchronize the configuration.

3. EntryX general ETL

Next, we introduce our general ETL service EntryX. The generality here can be divided into two meanings. The first is the generality of the data format, which supports various text data from unstructured to structured, and the second is the generality of the user group. The target users cover traditional users such as data analysis and data development. and business procedures, planning these users with weak data background.

Basic Concepts of EntryX

First introduce the three basic concepts of EntryX, Source, StreamingTable and Sink. Users need to configure these three modules respectively, and the system will automatically generate ETL jobs based on these.

Source is the input source of the ETL job, usually the original log topic collected from the business side, or the topic after distribution filtering. These topics may contain only one kind of log, but more often they will contain multiple heterogeneous logs.

Next, StreamingTable, a more popular name is Streaming Table. The flow table defines the main metadata of the ETL pipeline, including how to transform the data, and schemaizes the data according to the flow table schema defined by the transformed data. The flow table schema is the most critical concept, which is equivalent to Table DDL, mainly including field names, field data types, field constraints, and table attributes. In order to make it easier to connect upstream and downstream, the flow table schema uses the self-developed SQL-Like type system, which will support some of our extended data types, such as JSON types.

Finally, the sink is responsible for the mapping of the flow table to the physical table stored in the target, such as mapping to the target Hive table. The mapping relationship of the schema is mainly required here, such as which field of the flow table is mapped to which field of the target table, and which field of the flow table is used as the partition field of the target Hive table. At the bottom layer, the system automatically extracts fields according to the schema mapping relationship, converts the data into the storage format of the target table, and loads it into the target table.

EntryX ETL pipeline

Let's take a look at the specific implementation of the EntryX ETL pipeline. The blue part is the external storage system, while the green part is the EnrtyX's internal modules.

The data first flows in from the original data topic collected by docking, and is ingested into the Filter through the Source. Filter is responsible for filtering data according to keywords. Generally speaking, we require the filtered data to have the same schema. After these two steps of data extraction, it comes to the Transform stage.

The first step of Transform is to parse the data, which is the Parser here. Parser supports JSON/Regex/Csv three kinds of parsing, which can basically cover all cases. The second step is to transform the data, which is the responsibility of the Extender. Extender calculates derived fields through built-in functions or UDFs, most commonly flattening JSON objects and extracting embedded fields. Finally, there is the Formatter, which will convert the value of the field to the corresponding physical type according to the logical type of the field previously defined by the user. For example, a field whose logical type is BIGINT will be uniformly converted to the physical type of Java long here.

After the data completes the Transform, it comes to the final Load stage. Load The first step is to decide which table the data should be loaded into. The Splitter module will split the data according to the storage condition of each table (that is, an expression), and then go to the Loader in the second step to be responsible for writing the data to the specific external storage system. Currently, we support Hive/Kafka storage, Hive supports Text/Parquet/JSON formats, and Kafka supports JSON and Avro formats.

Real-time offline unified schema

In the design of Entryx, data can be written to both real-time and offline data warehouses, that is to say, the same data is represented in different formats in different storage systems. From the point of view of Flink SQL, they are two tables with the same schema but different connector and format. The schema part often changes with the business, while the connector and format (that is, the storage system and storage format) are relatively stable. So a natural idea is, can the schema part be extracted and maintained independently? In fact, this abstract schema already exists, which is the flow table schema we extracted in ETL.

In EntryX, the flow table schema is a schema independent of serializers and storage systems, as Single Source of Truth. Based on the flow table schema, plus storage system information and storage format information, we can derive the DDL of the specific physical table. At present, we mainly support Hive/Kafka, and it is very convenient to expand to support ES/HBase tables in the future.

Real-time data warehouse integration

An important positioning of EntryX is as a unified portal for real-time warehouses. In fact, Kafka table has been mentioned many times just now, but it has not yet said how the real-time data warehouse is done. A common problem with real-time data warehouses is that Kafka does not natively support schema metadata persistence. The current mainstream solution in the community is to store metadata of Kafka tables based on Hive MetaStore, and reuse HiveCatalog to directly connect to Flink SQL.

However, there are several problems for us to use Hive MetaStore: First, the introduction of Hive dependencies and coupling with Hive in real-time jobs is a heavy dependency, which makes it difficult for the defined tables to be reused by other components, including Flink. DataStream users; second, we already have the Kafka SaaS platform Avatar to manage physical schemas, such as Avro schemas. If Hive MetaStore is introduced, it will lead to fragmentation of metadata. Therefore, we are a schema registry that extends the Avatar platform to support both logical and physical schemas.

Then the integration relationship between the real-time data warehouse and EntryX is: first, we have the flow table schema of EntryX, call the schema interface of Avatar when creating a new sink, and generate the logical schema according to the mapping relationship, and Avatar will then map according to the Flink SQL type and the physical type. The relationship generates the physical schema of the topic.

Complementing the Avatar schema registry is our self-developed KafkaCatalog, which is responsible for reading the topic's logical and physical schemas to generate Flink SQL's TableSource or TableSink. For some users other than Flink SQL, such as users of the Flink DataStream API, they can also directly read the physical schema to enjoy the convenience of the data warehouse.

EntryX runtime

Similar to the operation log ETL, when EntryX is running, the system will generate Flink jobs based on common jars and configurations, but there are two cases that need special handling.

The first is that a Kafka topic often has dozens or even thousands of logs, so there are actually dozens or even thousands of flow tables corresponding to it. If each flow table runs separately in a job, then a topic may be read. Thousands of times, this is a huge waste. Therefore, an optimization strategy is provided when the job is running, and different flow tables of the same source can be merged into one job to run. For example, in the figure, a mobile game uploads three kinds of logs to Kafka, and the user configures three flow tables for player registration, player login, and gift package collection, then we can combine these three flow tables into one job and share the same one Kafka Source.

Another optimization is that in general, we can write data to Hive and Kafka at the same time according to the previous idea of ​​"extracting and converting once and loading once", but because Hive or HDFS is an offline system after all, the real-time performance is relatively poor, and writing In some old HDFS clusters with relatively high load, back pressure often occurs, and the upstream is blocked at the same time, resulting in Kafka's writing is also affected. In this case, we usually separate loading into real-time and offline ETL pipelines, depending on business SLAs and HDFS performance.

4. Tuning practice

Next, we will share with you our practical experience of tuning in ETL construction.

HDFS write tuning

The first is the tuning of HDFS writes. A common problem in streaming to HDFS scenarios is too many small files. Generally speaking, small files and real-time performance cannot have both. If the latency is to be low, then we need to scroll files frequently to submit data, which will inevitably lead to too many small files.

Too many small files mainly cause two problems: first, from the perspective of HDFS cluster management, small files will occupy a large number of files and blocks, wasting NameNode memory; At times, RPC and flush data are called more frequently, causing more blocking, and sometimes even checkpoint timeout. When reading, more files need to be opened to finish reading the data.

HDFS Write Tuning - Stream Pre-Partitioning

One of the tuning we do when optimizing the small file problem is to pre-partition the data stream first. Specifically, it is to perform a keyby partition based on the target Hive table inside the Flink job, so that the data in the same table can be concentrated as much as possible. On a few subtasks.

As an example, suppose the parallelism degree of Flink job is n, and the number of target Hive partitions is m. Because each subtask may read the data of any partition, in the default case that each subtask is completely parallel, each subtask will write to all partitions, resulting in the total number of files written to be n * m. Assuming that n is 100 and m is 1000, rolling a file every 10 minutes will result in 14,400,000 files per day, which is very stressful for many old clusters.

If the data flow partition is optimized, we can limit the growth brought by Flink's parallelism. For example, if we keyby hive table fields and add a salt in the range of 0-s integers to avoid data skew, then the partition will be read and written by s subtasks at most. Suppose s is 5, compared to the original n is 100, then we will reduce the original number of files to 1/20 of the original.

SLA statistics based on OperatorState

The second thing I want to share is our SLA statistics tool. The background is that our users often use the Web UI to debug and troubleshoot problems, such as the number of input and output of different subtasks, but these metrics will be reset due to job restarts or failovers, so we developed an OperatorState-based SLA-Utils tool to statistic data input and categorical output. This tool is designed to be very lightweight and can be easily integrated into our own services or user jobs.

In SLA-Utils, we support three metrics. The first is the standard metric, including recordsIn/recordsOut/recordsDropped/recordsErrored, which correspond to the number of input records/normal output records/filtered records/handled exception records respectively. Usually recordsIn is equal to the sum of the latter three. The second user-defined metric can usually be used to record more detailed reasons, such as recordsEventTimeDropped, which means that data is filtered because of event time.

Then the above two metrics are static, that is to say, the metric key must be determined before the job runs. In addition, SLA-Utils also supports the TTL metric that is dynamically registered at runtime. Such metrics are usually prefixed with a dynamically generated date and are automatically cleaned up after a TTL of time. TTL metric can be mainly used for statistics of day-level time windows. The special point here is that because OperatorState does not support TTL, SLA-Utils performs a filter every time a checkpoint snapshot is performed to remove expired metrics to achieve the effect of TTL.

Then what to do after the State saves the SLA indicator is to expose it to the user. Our current approach is to expose it through Accumulater. The advantage is that the Web UI is supported out of the box, and Flink can automatically merge metrics of different subtasks. The disadvantage is that there is no way to use the metric reporter to push to the monitoring system, and because the Acuumulater cannot be dynamically logged out at runtime, there is a risk of memory leaks using TTL metrics. Therefore, in the future we also consider supporting metric groups to avoid these problems.

Data fault tolerance and recovery

Finally, we will share our practice in data fault tolerance and recovery.

Similar to many best practices, we use SideOutput to collect error data in various ETL links and aggregate them into a unified error stream. The error log contains our preset error code, original input data, error type and error information. Under normal circumstances, the wrong data will be classified and written to HDFS, and users can know whether the data is normal by monitoring the HDFS directory.

After storing the abnormal data, the next step is to restore the data. There are usually two cases for this.

First, the data format is abnormal. For example, the log is truncated resulting in incompleteness or the timestamp does not conform to the agreed format. In this case, we usually repair the data through offline batch jobs and refill it into the original data pipeline.

Second, the ETL pipeline is abnormal. For example, the actual schema of the data has changed but the flow table configuration has not been updated, which may cause a field to be empty. At this time, our solution is: first update the online flow table configuration to the latest, It is guaranteed that no more abnormal data will be generated. At this time, there are still some partitions in Hive that are abnormal. Then, we issue an independent complement job to repair the abnormal data, the output data will be written to a temporary directory, and the location of the partition partition will be switched on the hive metastore to replace the original abnormal directory. Therefore, such a complement process is transparent to users of offline queries. Finally, we replace the data of the abnormal partition and restore the location at the appropriate time.

5. Future planning

Finally, we introduce our future plans.

  • The first is data lake support. At present, most of our logs are append type, but with the improvement of CDC and Flink SQL business, we may have more update and delete requirements, so data lake is a good choice.
  • The second will provide richer additional features, such as real-time data deduplication and automatic merging of small files. Both of these are very useful functions for the business side.
  • Lastly is one that supports PyFlink. At present, our Python support only covers the data integration stage, and we hope to realize the Python support of the subsequent data warehouse through PyFlink.

Original link

This article is original content of Alibaba Cloud and may not be reproduced without permission.

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324161660&siteId=291194637