TOP100summit: [Shared Record] The construction process of Lianjiawang's big data platform system

The content of this article comes from the case sharing of Bruce Lee, a senior R&D architect in the Big Data Department of TOP100summit Lianjia.com in 2016 .

Editor: Cynthia

Bruce Lee: Senior R&D Architect of the Big Data Department of Lianjia.com, responsible for the platformization of big data tools. Focus on data warehouse, task flow scheduling, metadata management, self-service reporting and other fields. Before that, he worked in Baidu for four years in the research and development of data warehouse and tool platform.

Introduction: The big data department of Lianjia.com is responsible for collecting the data of each product line of the processing company and providing data support for each business department of Lianjia Group. This article shares some problems and challenges encountered in the development and reform of Lianjia.com's big data department after the establishment Architecture system. It focuses on the transformation of the team as a supporter of data reports in the early days to the current data platform side. By sorting out the data processing process, it builds an integrated open platform for data access/calculation/display to improve the efficiency of data operation and quickly Meet data needs within the group.

 

1. Background Introduction

Since its establishment in 2014, Lianjia.com has comprehensively promoted the 020 strategy and built a closed loop of online and offline real estate services. The company's business has grown rapidly, covering 28 regions across the country, with more than 8,000 stores. With the continuous increase of data accumulated by Lianjia Group, a big data department was established in 2015 to promote the integration of data assets of various companies in the group and drive the development of the company's business with data.

Lianjia divides the real estate transaction big data into three parts: object data, human data, and behavior data for research.

● The real estate data is mainly to build a national real estate dictionary, with a professional photogrammetry team to conduct on-site surveys, and to collect detailed information on 70 million houses, including neighborhoods, humanistic qualities, and so on.

● Human data, including buyers, owners, and brokers, there are currently 130,000 brokers across the country, with detailed records of brokers' backgrounds, years of experience, qualifications, professional abilities, and historical behaviors, giving customers a more accurate reference . At present, there are more than 20 million buyers and sellers of Lianjia.com's services, who profile users and recommend more suitable houses.

● Behavioral data, including online behaviors and various offline behaviors, such as online browsing logs, offline viewing schedules, etc.

By analyzing these data, we can find the integration point with the business. At present, the specific application scenarios of big data in Lianjia.com include house valuation, intelligent recommendation, tenant map, and BI reports.

2. The architecture of big data from 0 to 1 is implemented

After the establishment of the big data department, drawing on the mature data warehouse solutions in the industry, the early architecture diagram of the design is shown in Figure 1:

Figure 1 Early architecture of data warehouse

At this stage we mainly do three things:

● To build a hadoop cluster, there were only more than 10 machines in the initial stage. With the development of the business, the scale of the cluster is also growing.

● Use HIVE to build a data warehouse. The data in the data warehouse comes from the mysql database and log log of the business side.

● Customized report development, according to the needs of the business side, to display some BI reports case by case to meet the business side's data analysis.

This architecture is simple and clear, and doing so has three benefits:

● Use open source components to facilitate expansion and operation and maintenance;

● Adopt the mature data warehouse solution in the industry, and design the data warehouse layered model;

● Conducive to the training of technical personnel. In the early stage of growth, the technical team needs to consider the personnel situation in the market, so the technology with a wide range of use is selected.

Specifically, the model of HIVE data warehouse is divided into 5 layers.

● The bottom is the STG layer, which is used to store the source data and keep it completely consistent with the data source;

● The second layer is the ODS layer, which will perform data cleaning and other work. For example, the city codes of different business systems are inconsistent. Some 001 represent Beijing, and some 110 represent Beijing. The ODS layer performs unified processing of dimension coding. In addition, the monetary units of different business systems are inconsistent, some are yuan, some are cents, and the unit of cents is uniformly adopted here, and two decimal places are reserved;

● The top layer is the report layer, which is processed according to business requirements and output report data. As for the paradigm structure that the data warehouse follows, there is currently no strict and consistent specification, both star schema and snowflake schema are used.

After the early big data architecture was implemented, it lasted for nearly a year. From the beginning of 2015 to the beginning of 2016, it achieved good results.

● Collect and summarize the data of each branch company and each product line within the group, which is convenient for cross analysis. By comparing and analyzing data, it can help the business system to develop better.

● Support most reporting requirements within the group, help operators improve decision-making, data-driven. It is difficult for a clever woman to cook without rice. When a large amount of historical data has been accumulated in the data warehouse, data mining students can conduct in-depth analysis.

3. Construction of big data platform system

Why do you want to platform?

The main reason is that with the rapid development of the company's business, the demand for data has increased rapidly, and the early big data architecture has encountered some new challenges.

●  Data demand grows rapidly: Lianjia's business has grown to many cities across the country, and each city has a lot of demand for reports, and due to different policies in different places, the demand for reports is also different, and there are also a large number of needs for temporary statistical data. In order to respond quickly to demand, we propose platformization, which allows users to obtain some data efficiently by themselves by providing various data processing and exploration tools.

●  Data governance needs to be standardized : After the data of each product line has entered the warehouse, due to the urgent demand, some modeling specifications have not been effectively implemented, resulting in redundant and complicated data in the warehouse, untimely wiki updates and maintenance, and it is difficult to clearly grasp the data An overview of the data in the warehouse. The definition of indicators is not clear. Some data demanders formulate the meaning of indicators according to their own understanding. After the results were launched, they found that different people had inconsistent understandings of indicators, resulting in rework.

●  Data security is imminent: centralized approval management is required for data applications, and data usage needs to be continuously tracked and filed to prevent data leakage.

In order to solve these problems, we propose a new platform-based architecture diagram. The data flow diagram of the platform architecture is shown in Figure 2:

 

Figure 2 Platform architecture data flow diagram

 

Comparing the old and new architecture diagrams, we can see that the first is the red real-time data stream part. The log log uses flume to connect to the Kafka message queue, and then uses SparkStreaming/Storm for log analysis and processing, and the processed results are written to Hbase for API. service usage.

In addition, in the OLAP part, Kylin is introduced as the MOLAP processing engine. The star model data in Hive will be processed and written to Hbase on a regular basis, and then Kylin will provide data analysis services to the outside world, providing sub-second query speed.

On the right side of the figure are data governance-related components, including data permissions, data quality, and metadata. In the new platform architecture diagram, we divide the big data engineering platform into three layers, from top to bottom, the application layer, the tool layer, and the base layer, as shown in Figure 3:

 

Figure 3 Big data engineering platform

3.1 Application layer

The application layer, mainly for data developers and data analysts, focuses on solving three types of problems:

● How to speed up the output of BI reports and shorten the time period from the data demand to the report output by the business side.

● Data governance, unified definition of the company's core data indicators, metadata management, centralized data approval process.

● Centralized management and control of data flow, the flow of data between various systems is unified through the metadata management platform, which can easily troubleshoot and locate problems.

In order to speed up the output of BI reports, we have developed a self-service report of the seismometer. When the data source is ready, the goal is to complete the configuration of a general report in 5 minutes, and obtain a chart effect similar to excel table and histogram. Currently, mysql is supported. , presto, kylin and other data sources. In addition, if you need customized Dashboard reports, self-service reports also support the reuse of some chart components.

metadata management system

Metadata manages and maintains all data information of the company. Through the data map, you can see all the data in the company's data warehouse and the changes of data information, which is convenient for users to search and query. The indicator library defines the indicators in detail, and can manage the dimensions associated with each indicator, and describe the dimension table and dimension values. In addition, based on metadata, we can also make data blood relationship, which is convenient to track the upstream and downstream relationship of data, and can quickly locate and troubleshoot problems.

After the metadata management system was launched, the following three achievements have been achieved:

● All tables are created and modified through the metadata system, which can clearly grasp the data situation in the warehouse in real time.

● A company-level data committee has been established to uniformly formulate the company's core indicators, and each department can customize secondary indicators.

● The access and outflow of data are centrally managed and controlled through the metadata system. All log access and MySQL access are configured through metadata, and data application is also based on metadata, which is convenient for centralized management and maintenance.

3.2 Tool layer

The tool layer is positioned in the development of general tool components, providing capability support for upper-layer applications, and at the same time solving the practical difficulties encountered by users in the use of big data computing. For example, there are many ETL job tasks, it is troublesome to track and troubleshoot problems, the data recovery time is long, the query speed of Hue hive is relatively slow, and it takes a few minutes to wait for a SQL.

Figure 4 is a typical data task chain diagram in actual work, and a part of the job chain is extracted.

 

Figure 4 Data task link diagram

From the figure we can see the following information:

● The task link is very long, there may be as many as 6 layers;

● There are many types of tasks, including mysql import tasks, hive-sql processing tasks, and email sending tasks;

● The types of dependencies are complex. There are hourly tasks that depend on minutes, and tasks that depend on each other.

For this kind of complex data link, we used oozie+python+shell to solve it before. There are more than 5,000 tasks, maintenance is difficult, and data repair problems are encountered, which is difficult to locate quickly. In order to solve these problems, we have independently developed a new task scheduling system with reference to open source software such as oozie and airflow.

On the new task scheduling system, users can self-service operation and maintenance, online or rerun tasks, and can see the running logs of tasks in real time. In the past, it was very troublesome to log in to the cluster machine to view the logs.

After the scheduling system went online, it achieved very good results:

● The task configuration is simple, and it can be operated by simply dragging and dropping on the graph.

● Provide common ETL components with zero coding. For example, in the past, when sending data emails, you needed to write your own scripts. Currently, we only need to configure recipients and data tables in our interface.

● One-click repair traceability reduces the time for troubleshooting and repairing data from one person a day to 10 minutes.

● Cluster resources are always tight. Currently, we are doing intelligent scheduling and off-peak operation to ensure that high-priority tasks run first.

Adhoc ad hoc query, the hue we used before, is relatively slow. After investigating various quick query tools on the market, we adopted the dual engine of Presto and Spark SQL. The architecture diagram is shown in Figure 5:

 

Figure 5 Dual-engine architecture diagram

3.3 Base Layer

The basic layer focuses on the construction and improvement of the underlying capabilities of the cluster. The problems encountered focus on two aspects:

● The amount of tasks has increased sharply. Currently, there are more than 10,000 JOBs per day, which causes the cluster resources to be very tight and the queues are severe.

● The data security of the cluster needs planning, and since multiple departments are using the cluster, the accounts and queues have not been divided before, and everyone uses it together. 

In response to these problems, we have made some improvements at the base layer.

In terms of cluster performance optimization, by dividing separate account queues and reserving resources, the execution of core jobs is guaranteed, and at the same time, it is connected with the permission management of the application layer, and different permissions are restricted for different directories according to user ownership. With the expansion of cluster data, many cold data are left unmanaged. After sorting out, we migrated the cold data to AWS S3 storage.

Fourth, the case inspiration

● How to quickly implement big data for traditional enterprises or start-up teams, first of all, they must adopt mature industry solutions. The practices of large Internet companies can be directly used for reference, and stable open source software can be used directly; secondly, they must thoroughly sort out the company's business and find a fit point, such as Lianjia Online home appraisal, personalized search, cross-report analysis.

● Faced with the rapid growth of the company's business, platform thinking is a magic weapon to solve problems. First of all, we need to automate these services by sorting out the user's process and usage habits, so that users can self-check some problems; secondly, the products developed on the platform must first solve the user's pain points and be willing to use them before they can be promoted to others. people use.

● Platform-based products need to sort out the process clearly and formulate specifications. First, we need to sort out the current situation of the company, and then standardize the process. Of course, the process of sorting out is more painful and requires the cooperation of many people; after the standard is formulated, it is necessary to ensure the authority and execution of the standard, and a company-level data governance committee can be established to publish the core Indicators to ensure the promotion and implementation of the process.

 

The schedule of the 6th TOP100 Conference has been confirmed. For more TOP100 case information and schedule, please visit the [official website]. Including product, team, architecture, operation and maintenance, big data, artificial intelligence and other technical special sessions, 4 days will focus on sharing the 100 most worthwhile research and development case practices at home and abroad in 2017. A total of 10 single-day free trial tickets for the opening ceremony will be sent out on this platform, and the number is limited on a first-come, first-served basis. Free trial ticket application entrance

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326172118&siteId=291194637