Use and Prospect of JuiceFS in Ideal Car

Lili Auto is a Chinese new energy vehicle manufacturer that designs, develops, manufactures and sells luxury smart electric vehicles. It was founded in July 2015 and is headquartered in Beijing. Its own production base that has been put into production is located in Changzhou, Jiangsu. , to provide safe and convenient products and services for home users.

Li Li Auto is a pioneer in the successful commercialization of extended-range electric vehicles in China. The first and currently the only commercialized extended-range electric vehicle model Li Li ONE is a six-seat mid-to-large luxury electric SUV (sports utility vehicle). car), equipped with a range extension system and advanced smart car solutions, will start mass production in November 2019, and launch the 2021 ideal ONE on May 25, 2021. As of December 31, 2021, Li Auto has delivered 124,088 Li Li ONE.

background

According to relevant national regulations and standards, the signal data of core components needs to be collected and reported to the new energy vehicle data platform built by the government during the driving process of new energy vehicles. The sources of these data are core components such as engines and batteries. At the same time, the regulatory authorities also require car manufacturers to store this data to support subsequent after-sales maintenance, OTA upgrades, vehicle health status detection, early warning, and maintenance. In order to better serve users, Li Auto started to build its own data platform.

Committed to creating a mobile home and becoming an ideal car for the world's leading smart electric vehicle company, the scale of data to be managed is very large. In today's article we're just talking about timing signal data from an ideal car. In the architecture of the automotive data platform, the full amount of time series signal data is stored in HDFS, and the Hadoop technology stack is also used to complete various complex computing and analysis tasks according to business requirements.

In December 2021, Li Auto delivered 14,087 Li Li ONE, an increase of 130.0% year-on-year in December 2020. From January to December 2021, Li Li ONE delivered a total of 90,491 units, a year-on-year increase of 177.4%. Since delivery, the cumulative delivery volume of Lili ONE has reached 124,088 units. It is conceivable that the growth of vehicle data managed by the data platform is also extremely fast, which places very high requirements on the agility and elasticity of the data platform.

Old drivers who play with big data know that the expansion of HDFS is time-consuming and labor-intensive, and sometimes it is even difficult to keep up with the growth rate of the business. Faced with the rapid development of business and the inflexible HDFS, engineers who maintain the data platform sometimes have to delete invalid and redundant data and balance the data of each data node to alleviate the contradiction between the high requirements of the business for agility and the inflexibility of HDFS . In addition, because Hadoop is a design of coupling storage and computing, increasing storage space also requires increasing computing, and often the matching of storage and computing is misplaced, and mismatched expansion will also bring a lot of computing power redundancy, manufacturing Unnecessary waste.

The continuous improvement of business development has also brought sweet troubles to the data platform. In 2020, the data platform began to solve the contradiction between the rapid business changes and the inflexibility of HDFS. The scales selected at the time were:

Minimize modification of existing ETL processes and calculation logic, in other words have excellent HDFS compatibility
Excellent elasticity
Transparent acceleration without performance bottlenecks
Stability is at least aligned to HDFS

At the beginning, we tested the Hadoop SDK integration solution provided by cloud vendors. However, because only limited Hadoop APIs were implemented and there was no cache, the stability and performance were far inferior to HDFS. The solution to this problem was delayed.

The beginning of 2021 coincides with the open source of JuiceFS, and colleagues from the data platform learned about the JuiceFS cloud service. JuiceFS is fully compatible with the HDFS API, and has both flexibility and caching . The preliminary judgment can solve the first three problems in the selection scale. With the mentality of giving it a try, we tried it out for the first time. I would also like to thank the JuiceFS friends for their great help, so that the JuiceFS community version can be successfully launched in the ideal car, which solves the problem of HDFS capacity shortage, and also realizes the architecture upgrade of the separation of Hadoop storage and computing. The most important thing is to meet the needs of the business. Agility requirements.

Introducing JuiceFS

JuiceFS is a high-performance open source distributed file system designed for the cloud environment. It is fully compatible with POSIX, HDFS, and S3 interfaces. It is suitable for scenarios such as big data, AI model training, Kubernetes shared storage, DevOps, and massive data archiving. Using JuiceFS to store data, the data itself will be persisted in object storage (such as Amazon S3), and the metadata corresponding to the file system can be stored in Redis, MySQL, TiKV and other database engines. At the same time, the JuiceFS client has caching capability to provide intelligent I/O acceleration for upper-layer applications.

Application scenarios

After half a year of use and iteration, JuiceFS has been used in multiple business scenarios of Ideal Car. The following are some typical business scenarios, which I hope will be useful to the users of the JuiceFS community, and you are welcome to put forward your ideas and questions.

JuiceFS supports core data warehouse storage

Scenes

Currently, 2TB of new data is added every day in vehicle data analysis scenarios. Data is directly read and written to JuiceFS through Spark for ETL processing. Because JuiceFS is fully compatible with HDFS API, you only need to specify the table path to the directory of JuiceFS. Switching is insensitive.

income

After switching to JuiceFS, the storage space has changed from the limited disk of HDFS to the object storage with unlimited capacity. At the same time, the separation of storage and computing of Hadoop cluster is also realized. Now the storage can be elastically scaled using JuiceFS, and the computing cluster can also be independent according to the business volume. Expansion and shrinkage . In this way, the data platform can support business growth and demand changes more agilely.

improvement plan

When the business went online in the first half of the year, JuiceFS used public cloud-hosted Redis to store metadata. Because the transaction API of Redis is required, the Redis cluster mode cannot be used, so the capacity expansion bottleneck of a single Redis instance brings a limit on the number of files in a single JuiceFS file system, and all tables have not been migrated to JuiceFS for the time being. Now that JuiceFS supports TiKV to store metadata, we are ready to test and migrate all data to JuiceFS, and use the free local physical disk as a cache disk.

Use JuiceFS to support MatrixDB hierarchical storage of time series database

Scenes

In the ideal car MaxtrixDB cluster, even after compression, there is still nearly 500G of incremental data every day. This type of time series data is highly time-sensitive, and the longer the time, the less frequently it needs to be viewed. The paradox is that even historical data has low-frequency query requirements, and historical data cannot be deleted; however, MaxtrixDB uses local storage in its architecture design and cannot flexibly expand or shrink. After seeing the data tiering practice of JuiceFS on ClickHouse , we recommended it to the MatrixDB team, and soon MaxtrixDB supported the automatic hierarchical storage mechanism, which successfully realized the automatic transfer of warm and cold data from local disks to JuiceFS to meet query requirements.

income

The storage cost is reduced by nearly 50% when the user is basically unaware of the use . Use JuiceFS to implement data tiered storage of the time series database MatrixDB. Hot data is written to the local SSD and automatically transferred to JuiceFS through the lifecycle policy. The whole process only requires simple configuration, automatic and transparent, no need for frequent manual expansion, and can greatly save storage costs. The spare SSD capacity can be used for cache acceleration of warm and cold data, and occasional use of cache acceleration can also maintain good performance.

Cross-platform data exchange

Scenes

The data platform is the Hadoop technology stack, and the algorithm platform uses Kubernetes for resource management. The two platforms have upstream and downstream relationships in many businesses. The data platform is responsible for preparing data and then sending it to the algorithm platform to complete the training of the algorithm model. The way we solve the data exchange is that the data platform writes the data directly to the Hive table, and the bottom layer of the Hive table uses JuiceFS storage. When the algorithm platform starts the Pod, it automatically mounts the same JuiceFS file system in POSIX mode. The application in the Pod can read the feature data just like accessing the local directory. The trained result is also written to JuiceFS in POSIX mode. Platform students can also easily use the results provided by algorithm students.

income

Data workflows are getting longer and more complex, and work tasks need to be completed collaboratively on different platforms and teams. In the past, data was always moved to and from different storage systems. The time to copy, the time to check the correctness, these waiting and repeated work greatly affect the efficiency. Now JuiceFS, as a unified data lake, can share various types of data in different platforms and applications without waiting, and the efficiency is much improved .

improvement plan

Currently, the data platform uses a multi-tenant approach for data ETL operations. The Pod pulled up by the algorithm platform is the root user by default. After the algorithm colleague writes back the result data to JuiceFS, only the root user has the write permission, and the Hive component of the data platform will fail to add a new partition because there is no write permission when adding a partition. Recently, the community has a new solution, which is to add the user of the Hive component to the supergroup of Hadoop, so that the user has the same write permission as the root user. We will test this solution together with the algorithm platform after the new version is released in the near future.

Platform shared files

Scenes

In the past, the entire data platform used HDFS to share files. The front-end application of the platform directly uploads data to HDFS through the back-end service interface. On the one hand, there are security risks. On the other hand, there will be failures to download large files from HDFS during the intensive execution of tasks in the early morning, which affects the stability of tasks. At present, the file sharing of the real-time development platform has been switched to the JuiceFS POSIX method to provide support for sharing files. In the future, it is planned to integrate all platforms that need to share files to JuiceFS for unified management.

income

POSIX access can make application development easier and more efficient . At the same time, JuiceFS also provides more stable throughput than HDFS peak.

Outlook

After nearly a year of use, we have been following the iteration of the JuiceFS community, and have a better understanding of JuiceFS. We can get feedback from the community in a timely manner when we encounter problems, and the problem is solved very quickly. Thanks to the great efforts of our community partners. Support, we have been continuously upgrading with the release of the community (JuiceFS upgrade is very simple). In the work plan for next year, on the one hand, the scene will continue to expand and deepen, and we plan to verify and promote it in the company's large number of image retrieval scenarios for autonomous driving. On the other hand, we have also started to develop JuiceFS, and after verification, we will discuss with the community and feed it back to the upstream code.

First of all, the goal in 2022 is to weaken until HDFS is removed, and object storage will be used as the underlying storage for the entire data lake later. And hope to open up data sharing at the data lake and data warehouse level. Since object storage requires network overhead, it has good scalability and loses some efficiency. JuiceFS provides local cache to improve performance. Ideal car storage team is currently preparing to develop functions to increase cache hit rate, such as local P2P reads.

Second, our entire platform runs in a multi-tenant environment, and JuiceFS is currently designed for a single file system and has no multi-tenancy capabilities. We are also preparing to develop a management tool similar to Apache Ranger, which provides centralized management of security policies and monitoring of user access to manage data security in JuiceFS.

Third, JuiceFS currently needs to directly transfer meta information when using POSIX mount experience. On the other hand, the deployment details of the cluster can be isolated to facilitate the maintenance of the platform team.

Fourth, the data lake scenario plans to use TiKV for metadata storage, but in individual scenarios, TiKV is not as fast as Redis. Therefore, consider that some scenarios that require high metadata performance but control the amount of data continue to use Redis. In this way, there is a need to maintain multiple sets of JuiceFS. It's as if every JuiceFS is a directory. What the user sees is the same as a file system. Hadoop-like ViewFS with multiple NameNodes.

Welcome to our project Juicedata/JuiceFS ! (0ᴗ0✿)