The practice of JuiceFS in Dasouche data platform

Soouche has built a relatively complete Internet collaboration ecosystem for the automotive industry. In this ecosystem, it not only covers 90% of the country's large and medium-sized used car dealers, 9,000+ 4S stores and 70,000+ new car second-hand networks that have been digitized by Dasouche, but also includes Cheyipai and Chexing 168 under Dasouche. Companies with strong industrial chain service capabilities such as , Car Delivery Manager, Brexo, etc., and mainframe manufacturers such as Great Wall Motor, Changan Automobile, and Infiniti that have reached in-depth strategic cooperation with Dasouche on new retail solutions, and CNPC Kunlun Hospitality and other upstream and downstream partners of the industrial chain. Based on such an ecological layout, Souche digitizes every link in the automobile circulation chain, thereby empowering the entire industry.

When it comes to big data, every company is familiar with it. Storage component HDFS, computing resource management YARN, offline computing Hive, Spark, Spark SQL, column storage database HBase, real-time computing Spark Streaming, Flink, etc. These components are relatively easy to maintain when the cluster is stable, but in the process of rapid development of the company, the rapid growth of cluster capacity is inevitable. As a designer of big data, we have to think about the cost and benefit of the cluster. trade off.

Status Quo of Big Data Clusters

Dasouche's current big data clusters are divided into offline computing clusters and real-time computing clusters. Offline computing is based on Hive and Spark, and real-time computing is based on Flink. These two types of clusters are based on HDP and CDH management methods respectively. HDP was selected for offline computing in the early days, and CDH was later selected for real-time computing because of the convenience of multi-cluster management. Due to the difference between the two offline computing engines, there will be compatibility problems in migration. The two clusters have always coexisted, and the resources between the clusters are completely isolated.

Cluster maintenance pain points

The amount of data continues to grow, and cluster expansion is time-consuming and labor-intensive at a certain cost.

From the beginning of 2018 to June 2019, the offline cluster continued to grow from the initial dozens of nodes to hundreds of nodes, and the data volume also increased from tens of TiB by more than 10 times, and maintained the rate of increasing TiB per day. In the case of saving expenses, the cluster expansion is done once a month, forming a situation of racing against the speed of data growth. The monthly fixed work has almost become a situation of receiving disk alarms, expanding capacity, balancing data, and rebalancing data. In some extreme cases, for example, Alibaba Cloud does not have data type device resources in one availability zone and needs to create a new one in another availability zone, which also involves the change of the data network segment, which is more complicated.

  1. Resources required for storage are out of sync with computing resources

During the analysis of offline cluster data, it was found that hotspot data only accounted for about 20%. In the case of continuous expansion of the cluster, computing resources will be redundant, resulting in unnecessary costs. In addition, each balancing will occupy the node network bandwidth, affecting the speed of tasks reading and writing data.

  1. Synchronize data across clusters

In order to reduce the interaction between real-time tasks and offline tasks, and to facilitate resource control and maximize the value of cloud resource selection, real-time computing and offline computing clusters are physically separated from resources, and difficulties also arise. The data of real-time and offline clusters Unable to synchronize in real time, some requirements cannot be realized.

  1. NameNode memory continues to grow, restarting takes too long

In file storage, the excessive number of files leads to a continuous increase in the management memory of NameNode. It takes too long to restart once, which will inevitably affect data synchronization; and if the data life cycle is not strictly controlled at the data warehouse level, the resource occupation will also increase. It is also affected when analyzing the entire resource in the cluster.

Choose JuiceFS

In view of the above problems, it is imperative to choose a product as the underlying storage. Storage selection as the cornerstone of big data needs to comply with the following characteristics:

  • Compatible with Hadoop framework protocol
  • Multi-version cluster compatibility
  • High throughput, low latency
  • Support deep compression to reduce resource usage

At the beginning, we tried to use Alibaba Cloud's OSS as the cold backup storage. During testing, data maintenance is limited due to no meta-information management. Later, I came into contact with the JuiceFS product, which met the above requirements in terms of selection. We did some performance tests for this (all extracting business logic based on actual scenarios).

Actual scene performance test

The following tests all select actual business data, and the data size is selected according to where query conditions are different. Only two file system performance comparisons are made:

  • SELECT + INSERT operations

Select data of different magnitudes from about 30 million tables and insert them into another table with the same table structure. It takes time to compare HDFS and JuiceFS horizontally.

  • SELECT + COUNT operation

Select data of different magnitudes from about 30 million tables to do COUNT, and compare the time-consuming of HDFS and JuiceFS.

  • SELECT + ORDERBY

Sort the data in the table of about 30 million, and compare the time consumption of HDFS and JuiceFS horizontally.

To sum up, JuiceFS takes relatively stable time for querying and inserting data, and the overall time-consuming is less than that of HDFS. In the case of SELECT data, most of the performance is similar, and in some cases, it is better than HDFS, and the performance of single-row sorting operation is similar.

Cost Control

We compared the cost of using JuiceFS and HDFS (HDFS cluster guarantees 20% storage redundancy). With the same amount of data (JuiceFS will do deep compression again, the compression ratio is about 3:1) and equivalent computing resources, using JuiceFS will save at least 18% per month compared to deploying HDFS using cloud hosts.

Overall, the performance and cost of JuiceFS are very satisfactory to the company's requirements for cost and product performance.

future outlook

Separation of storage and computing

With the introduction of JuiceFS in big data clusters, storage and computing have actually been separated. It has become possible to expand computing resources flexibly and elastically in big data clusters. In the early morning, the computing resources of business machines can be dispatched to big data clusters.

The following is the current whole big data cluster architecture:

In the future, the following target architecture can be designed by combining computing and storage separation and dynamic scaling:

Combine with Kubernetes, apply for resources on demand, save costs and reduce maintenance costs.

Recommended Reading: Best Practices for JuiceFS CSI Driver

Project address : Github ( https://github.com/juicedata/juicefs ) If you are helpful, please follow us!  (0ᴗ0✿)

{{o.name}}
{{m.name}}

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=324117147&siteId=291194637