Pleasant credit PaaS Data Services Platform Genie: technical architecture and function

 

Part I: Architecture and Components

First, the data platform development

1.1 Background

With the advent of the era of data, increase the amount of data and the complexity of promoting the rapid development of data engineering. In order to meet the various data acquisition / computing demand, the industry emerged a lot of solutions. But most programs have the following principles:

  • Data processing cost reduction

  • Increase data using rational / computational efficiency

  • To provide a unified programming paradigm

Pleasant loan data services platform is to follow these three principles. I have personally experienced the entire development process credit data platform Genie is pleasant, pleasant loans and look at the industry, it can be said development Genie is a microcosm of the industry data platform development.

Google's release of the three papers and open-source Apache Hadoop ecosystem should be the starting point for big data processing technology into the "common people" of. Hadoop components can run on an ordinary low-cost machines, plus its code is open source, so it has been favorable for many companies. So these companies are beginning to use it what you do with it?

The answer is a data warehouse.

注:Google三大论文:Bigtable: A Distributed Storage System for Structured Data;The Google File System;MapReduce: Simplefied Data Processing on Large Clusters

So probably the early architecture of the data table is constructed Sqoop + HDFS + Hive these three components, because this is to build a data warehouse is the cheapest and efficient way. At this time, the data warehouse can only answer what happened in the past (off stage), because t + 1 Snapshot program Sqoop offline extract is generally used, which means that only yesterday's data.

Then as demand for real-time data improves, the need to do complex operations associated with real-time incremental data aggregation, etc., this time data platform will join a distributed stream computing architecture, such as: Strom, Flink, Spark Streaming and so on. At this time, the data warehouse can answer that (real-time stage) what is happening.

Since the off-line data processing flow (eg: Sqoop + HDFS + Hive) and real-time data flow (eg: Binlog + Spark Steaming + Hbase) coupled to two larger flow calculation logic, and to support the real-time data analysis by a combination of the total amount, so it generated a lot of infrastructure, such as early Lambda, Kappa and so on. At this time, historical data and real-time data combined with data warehouse can answer the question what will eventually happen (forecast period).

Data platform development so far is no longer a data warehouse will be able to explain, and it works closely with various business sectors (such as marketing, telemarketing, operator) to create a lot of data products. At this time, data warehouse (Data Platform) has entered the active phase of the decision-making.

In fact, the development of predictive and real-time order of different companies are different, you can only use historical data to make predictions.

1.2 Data Platform positioning

Data platform should belong to an important part of the infrastructure, there are many companies follow the trend of building large data clusters within the Internet industry have found it difficult to play a real value, in fact, the most important reason should be the positioning of the data used and the positioning of the data platform . Current data platform location are as follows:

  • Enabling decision-making

Is enabling decision-making, decision-making through the use of BI reports to quickly understand the operation of the company, because the data does not tell lies.

  • Business data analysis / data products business

Adhoc platform can provide instant analysis to help business analysts to quickly analyze quickly locate problems, rapid feedback.

  • Computing storage

Business data products can also take advantage of computing resources platform to build data storage products, such as recommended, smart marketing and so on.

  • effectiveness

Improve data processing efficiency, thereby saving time cost data mining / processing.

Most of the company's early personnel structure as shown below:

Operations, marketing, and decision-making direct use of the platform, is the most direct view BI reports. Business analysts combing business needs will provide complete data warehouse needs to engineers, data warehousing and professional engineers will join the new demand among the existing company-level data warehouse. Data Engineering team is responsible for operation and maintenance clusters.

The shortcomings of the initial 1.3 architecture

Why is this the beginning of the architecture described here is not to do too much, we directly talk about its shortcomings.

  • It found that when decision-makers use the report always slow a beat, there will always be new needs it. The reason is simple: In fact, the Internet company's business is not like traditional industries (such as banking, insurance, etc.) of business less stable, because the Internet company's relatively fast development, business iterative update soon.

  • Business analysis there is always a variety of temporary needs, and similar reasons 1.

  • Data Warehouse Engineer tired as a dog. Data warehouses are bulky, difficult to operate and flexible, always pull situation as a whole.

  • Cluster operation and maintenance work hard, too coupling between jobs, for example: A table service did not run out a direct impact on all jobs throughout the company.

1.4 Common Solutions

I believe these troublesome problem many companies have encountered, solutions should also be similar. Roughly as follows:

  • Build products of data service platform.

  • Data warehouse more energy transfer to the more fundamental problem underlying data, such as data quality issues, the use of standardized data, data security issues, model architecture design.

  • Business analysts use direct business platform to build data marts, improve agility and specificity.

  • The main responsibilities of project data is no longer a cluster operation and maintenance, but to build a data service platform and build a business data products.

The advantage of this is:

  • To solve the bottleneck of data warehouse.

  • Let the people most familiar with their own data to build data marts and more efficient.

  • Business Data Product Data Services Platform can be used directly to improve efficiency, reduce company costs.

Second, the credit data platform Genie pleasant architecture and features

2.1 Genie architecture

Pleasant mortgage finance companies belong to the Internet, as with the financial attributes, so the requirements for safety, stability, such as data quality platform to be higher than the general Internet companies. Currently in pleasant loan data structure, the total amount of data to PB-level, incremental daily TB level. In addition to the structured data, as well as logging, voice and other data. Application data types are divided into two major categories of operations and marketing, such as smart electricity sales, smart marketing. Data Services Platform need to ensure that thousands of batch jobs running on time every day, and to ensure data products for real-time data to calculate the efficiency and accuracy at the same time, but also to ensure the effectiveness of a large number of Adhoc queries per day.

These are underlying technology platform architecture diagram, the overall architecture is a Lambda, Batch layer is responsible for computing data t + 1, the main task of the majority of the timing and reporting the data warehouse / mart processed in this layer. Speed ​​layer responsible for calculating the incremental data in real-time, real-time warehouse number, the incremental real-time data synchronization, data products and other major use of this layer data. Batch layer sqoop timing synchronization using the HDFS cluster, and then calculated using Hive Spark SQL. Batch layer of stability than the computing speed is important, so we focused on stability is optimized. Output Batch layer is Batch view. Speed ​​layer is relatively Batch layer data links take longer, architecture is relatively complicated.

Wormhole DBus and letter should be open source projects, mainly used for the data pipeline. The basic principle is to DBus real-time incremental data synchronization by reading binlog database, the main problem is non-invasive incremental data synchronization. Of course there are other options, such as time stamp cards, increased trigger, but also to achieve incremental data synchronization, but the pressure on the business library and too intrusive. Wormhole basic principle of incremental data is synchronized from the DBus consumption and to synchronize the data to different storage, support synchronous way homogeneous and heterogeneous.

Overall Speed ​​layer data will be synchronized to our various distributed databases, distributed database these collectively called Speed ​​view. Then we put the metadata Batch and Speed ​​unity abstracted layer called Service layer. Service layer provides services through NDB external unified. Because there are two main attributes of data, i.e., data = when + what. In this time dimension is when data is immutable, additions and deletions are actually generate new data. In the usual data use, we often focus only on the properties of what, in fact, can only be determined when + what the only immutable characteristics of data. So we can be abstracted to divide the time dimension of the data in accordance with the dimension of time, i.e. t + 1 data in Batch view, t + 0 data in Speed ​​view. This is the intent of the standard Lambda architecture: to separate off-line and real-time computing. But our Lambda architecture slightly different (here not too much expression).

To know that the cluster resources are limited, such as the off-line and real-time computing architecture on the issue of preemption of resources within a cluster inevitable. Because each company's computing storage solutions may not be the same, I'm here just to our program, for example, we hope to play a valuable role.

To resolve seize the problem, let us clear understanding about preemption. From the user dimension, if the platform is a multi-tenant, then seize the possibility that exists between tenants; from the data architecture, if off-line computing and real-time calculation is not deployed separately, then there is also the possibility of preemption. It is emphasized that not only refers to seize seize cpu and memory resources, network disk io io also be preempted.

Resource scheduling systems currently on the open-source market, such as yarn, mesos isolation and other resources do not very mature, can only do some minor isolated on cpu and memory (hadoop3.0 of yarn has joined the network and disk io isolation mechanism). Because our work is basically "everything on yarn", so we yarn has been modified. Modify and official solutions for yarn of similar use cgroup to achieve. Room service processes should also be made for the isolation cgroup, as datanode nodemanager time on a single machine.

Good description of the data platform Genie composition and use of the data flow diagram. Let me talk about the use of data flow, first of all data (both structured and unstructured data) will be standardized in the data warehouse, such as: a unified unit, Dictionary unified, unified data format, data naming unity and so on. Data standardized data mart will be used directly or indirectly as entry data mart. Coupling between data mart business is very low, so the data coupling also low, so you can well avoid coupling of the whole job. All business data applications will directly use their own data marts.

2.2 Genie function modules

Besides the composition of Genie, Genie as a whole is divided into seven sub-system.

  • meta data: metadata management is the core of the core, the metadata service is the foundation to do basic data platform, almost all the demand functions will rely on it to conduct.

  • Authority: consolidated permissions section, unified management, flexible configuration. Here rights include access to configuration data.

  • Monitor: monitoring, statistical cluster dimensions in accordance with the tenant usage and so on.

  • Triangle: self-scheduling system developed, distributed, service-oriented, high availability, user-friendly. Triangle is a schematic diagram as FIG scheduling system. Overall is a Master Slave architecture, Job Runtime Dir concept refers to the currently running Job environment needed to provide a complete package, such as Python environment.

  • Data Dev: The figure is a data development process. Data Development Platform - line on a one-stop platform for development and testing, safe, efficient, support for SQL, Python, Spark Shell.

  • Data Pipeline: data pipeline, the pipeline configuration for off-line data and real-time data management pipeline configuration management. One minute can be achieved complete off-line and real-time configuration warehousing warehousing configurations.

  • Data Knowledge: Knowledge data for unrelated queries, index data management.

Third, the summary

Not the best architecture, only more suitable architecture. Circumstances of each company is different, the business model is not the same, though they are ETL data processing, data warehousing are, are machine learning, but how much demand there is a data warehouse? What is the application of machine learning is the scene? ETL real-time requirements is kind of how? These details are a lot of complicated objective conditions constraints.

There are two critical factors in the selection of technical architecture, that scene, and cost. Simply put, the scene is to do what you want to achieve cost-effective manner, do not over-design. If the scene is complex, it can be abstracted from the multi-dimensional segments, such as: time dimension (history of problem to be solved, the current problem, possible problems in the future). Similarly, in terms of cost, there are many dimensions that should be considered, such as: development cycle, operation and maintenance complexity, stability, existing staff of the technology stack, and so on.

In the next, we will change from "real-time data warehouse technical details" and "data platform Introduction" we continue to interpret both PaaS Data Services Platform Genie pleasant loan, please sustained attention.

Part II: technical details and features

REVIEW: In Part I, we have a quick look at credit data platform Genie pleasant features, and got some information about the development process of the data platform. As the next article, we will first of which focuses on the technical details of the real-time data warehouse, after the introduction of functional data platform. Here we take a look at this knowledge it ~ 

Fourth, the real-time data warehouse technology details

Data warehousing is an offline t + 1, that is the data processing timeliness previous day's data. General policy program to synchronize data offline is timing synchronization data once a day, and is basically a full synchronization amount of data that is a full-day volume of data (Business Library) of the mirror.

In addition to timeliness, another point is that the data is only a mirror state, so I want to know the history of the process of change a value, we need to take the zip table (very time-consuming and resource). Implementation of real-time data warehouse lot, but most of them are the same thing.

Real number of positions for two features: first access real-time data; a second return results in real time can be approximated. Of course, if off-line warehouse Optimized good, to complete the second point it is also achievable. Two Problems, Why use real-time data? Why should there be real-time data warehouse?

In recent years, engineers have data on how to improve the timeliness of the data to do a lot of efforts and attempts. Promote these real-time data synchronization, of course, deal with the development of technology and demand scenario. China's largest Internet environment, competition is fierce, how to improve conversion rates becomes even more critical.

User portrait, recommendation system, funnel analysis, intelligent data related to product marketing, etc. are inseparable from the calculation of real-time data processing.

The most direct way to obtain real-time data is directly connected business library, obvious advantages, disadvantages are also obvious, and some logic requires a multi-source cross-database query time associated directly connected to the business library will not work. So you first need to place multiple sources of data sets synchronized together, the synchronization process is a very challenging, to consider the timeliness of data, invasive of business systems, data security and data consistency, and many problem.

So we need a tool to synchronize data, it needs to have the following characteristics:

  • Near real-time synchronization can produce libraries of data and log data
  • As well as the production library and application server completely decoupled
  • Out of synchronization data can be distributed to other storage
  • The entire synchronization process ensures that data is not lost, or in any time can be re-synchronized batch

CreditEase agile team development of large data DBus and Wormhole well positioned to meet more than 4 points.

DBus binlog database using data extraction, generally binlog delay is relatively low, it will ensure the real-time characteristics, but also to ensure the production of a zero intrusion library.

In fact, to build a robust system using the log data is a very common scenario. Hbase wal use to ensure the reliability, MySQL master and slave synchronization using binlog, a distributed consensus algorithm Raft use the log to ensure consistency, as well as Apache Kafka also use the log to achieve.

DBus good use of the binlog log database and unified schema transformation, formed its own logging standard to support a variety of data sources. DBus is the definition of a commercial-grade data bus system. It can be real-time data extraction from a data source to a Kafka.

Wormhole responsible for the data storage among other synchronous writes. Kafka became a data bus in the true sense, Wormhole support sink end at any time in accordance with the start of consumption data Kafka, which also can be very good for data back.

Genie's real-time architecture is as follows:

With DBus and Wormhole we can easily put the library from production equipment in real-time synchronization of data to our Cassandra cluster, and then synchronize Presto, to provide users with SQL language computing.

Through this simple architecture we build efficient completion of the real-time data warehouse, and achieve the company's real-time reporting platform and some real-time marketing class data products.

Why would I use for Presto can give the following answer:

  • Presto has a level of interaction data computing experience inquiry

  • Presto supports horizontal scaling, presto on yarn (slider)

  • It supports standard SQL, and facilitate expansion

  • facebook, uber, netflix production and use

  • Language java open source technology stack meet our team, custom functions

  • Supports multiple data sources associated with the logical join pushdown, Presto can take Cassandra, Hdfs etc.

  • pipelined executions - reducing unnecessary I / O overhead

Presto is m / s architecture, the overall details not much to say. Presto has a data storage abstraction layer may support SQL calculation performed on different data storage. Presto provides meta data api, data location api, data stream api, support the development of self-pluggable connector.

In our scenario is Presto on Cassandra, because Cassandra with respect to the availability of Hbase is better, more suitable for adhoc queries scene. Hbase CAP propensities c, Cassandra CAP biased in a. Cassandra is a very good database, easy to use, the underlying use Log-Structured Merge-Tree do kernel data structures to store the index.

Fifth, the overall data processing architecture

In conclusion I probably introduced pleasant loan real-time data processing architecture, let's look at the overall data processing architecture.

Lambda overall speed layer architecture using the DBus and Wormhole assembly became a real-time data bus, speedlayer directly support real-time data products. DataLake is an abstract concept implementation, we mainly use Hdfs + Cassandra to store data, calculation engine and Presto based mainly in the Hive, and then provide the metadata integration through a unified metadata platform, thus achieving a complete DataLake. DataLake main scenario is advanced flexible analysis, query scenarios such as ml.

Difference DataLake and data warehouse is, DataLake more agile and flexible, focusing on data acquisition, the data warehouse will focus on standards, management, security, and fast indexing.

Sixth, the data platform Genie function modules

Genie entire data services platform consists of seven major sub-platform modules:

  • data query

  • Knowledge of data

  • Real-time reporting

  • Data Development

  • Job Scheduling

  • authority management

  • Cluster monitoring and management

Here we introduce a few of the modules.

6.1 Data Query module

  • Users can query the data warehouse, data marts, data warehouse in real-time

  • By parsing SQL to achieve fine-grained privilege management

  • It offers a variety of query engine

  • Data output

6.2 Data Knowledge Module

  • Metadata monitoring and management

  • Query function to provide management of metadata across the company

  • Metadata can monitor changes and warning messages

  • Blood analysis query engine

  • SQL Analysis Engine

  • All assignments for warehouse / table / field for analysis

  • Provide blood analysis / impact analysis

6.3 Data Reporting Module

  • Real-time data warehouse

  • Soon on Cassandra 直 连 Soon

  • Hundreds of tables, real-time synchronization (DBus + WHurl)

  • Da Vinci reporting platform (DaVinci url)

  • Zhang has nearly full report use

6.4 Development Module Data

  • Data programming Genie-ide

  • Genie-ide provide data for development programs

  • Provide network management to save disk script

  • Real-time test / on-line

  • Data Pipeline

    • A key warehousing offline

    • A key real-time warehousing

Scheduling module Triangle 6.5

  • Micro-service architecture design Each module is a service

  • Provide restful interface to facilitate the secondary development of integration with other platforms

  • Provide health monitoring operations management background

  • The provision of public jobs and private jobs

  • Logical isolation between the workflow

  • Concurrency control, policy management failure

Seven data platform Genie function

The above is an introduction to the data module functions Platform Genie, Genie platform that specific things you can do what?

First, it allows offline warehousing, warehousing real configuration (data warehouses, data marts) over 1 min;

Secondly, the real-time warehousing can directly configure real-time reports showing the push (BI analysis);

Third, support a variety of real-time data isomorphic docking rights contained so safe: api, kafka, jdbc (business data products);

Fourth, stop data development support hive, spark-sql, presto on cassandra, python (data development);

Fifth, the scheduling of the service support system access external systems (based technology components).

references:

https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/

http://thesecretlivesofdata.com/raft/

https://engineering.linkedin.com/data-replication/open-sourcing-databus-linkedins-low-latency-change-data-capture-system

https://yq.aliyun.com/articles/195388

https://www.cnblogs.com/tgzhu/p/6033373.html

Author: SUN Zhe

Source: CreditEase Institute of Technology

Guess you like

Origin www.cnblogs.com/yixinjishu/p/11038977.html