How does Xiaohongshu cope with the challenge of trillion-level social network relationships? The graph storage system REDtao is here!

Xiaohongshu is a product focused on community attributes. It covers life communities in various fields and stores massive social network relationships. In order to solve the problem of updating and associated reading of ultra-large-scale data in social scenarios, and to reduce database pressure and costs, we developed a self-developed graph storage system REDtao for ultra-large-scale social networks, which greatly improves system stability. This system draws on Facebook's graph storage system design, encapsulates the cache and the underlying database, and provides a unified graph query API to the outside world, achieving access convergence and efficient edge aggregation in the cache.

Xiaohongshu is a life recording and sharing platform mainly for young people. Users can record their daily life and share their lifestyle through short videos, pictures and texts. In the social field of Xiaohongshu, we have entities such as users, notes, and products, and there are various relationships between these entities. For example, there may be three relationships between users and notes: "own" (publish), "like", and "collect". There are also corresponding inverse relationships such as "like" and "collect".

Xiaohongshu’s social graph data has reached a scale of one trillion edges and is growing very fast. When users log in to Xiaohongshu, each user will see the friends, fans, likes, collections and other content tailored for them.

This information is highly personalized and requires real-time reading of user-related information from these massive social relationship data. This is a read-oriented process, and the reading pressure is very high.

In the past, we stored these social graph data in the MySQL database with mature operation and maintenance. However, even though we only scaled to a million requests per second, MySQL's CPU usage still reached 55%. With the explosive growth of users and DAU, the MySQL database needs to be continuously expanded, which brings huge cost and stability pressure. In order to solve these problems and considering that there is no suitable open source solution, at the beginning of 2021, we started the process of self-development of REDtao from 0 to 1.

We have fully investigated the implementation of other manufacturers in the industry and found that companies with strong social attributes basically have a self-developed graph storage system:

Facebook has implemented a specialized distributed social graph database called "TAO" and used it as the core storage system; Pinterest, like Facebook, has also implemented a similar graph storage system; ByteDance has developed ByteGraph by itself and used it as its core storage system. Used to store core social graph data; Linkedln builds social graph services on KV.

Considering that our social graph data was already stored on the MySQL database at that time and was huge in scale, the social graph data service was a very important service and had very high stability requirements. Looking back, Facebook encountered problems similar to ours. Data was stored in Memcache and MySQL. Therefore, referring to Facebook's Tao graph storage system is more in line with our actual situation and existing technical architecture, and has less risk.

Access to the social graph is mainly edge relationship query. Our graph model represents the relationship as a <key, value> pair, where key is a triple of (FromId, AssocType, ToId) and value is the JSON format of the attribute. For example, "User A" follows "User B", and the data storage structure mapped to REDtao is:

 <FromId:用户A的ID, AssocType:关注, ToId:用户B的ID>  ->  Value (属性的json字段)

We analyzed the needs of the business side and encapsulated 25 APIs with graph semantics for the business side to use, which met their needs for additions, deletions, modifications, and searches, and converged the usage methods of the business side. Compared with Facebook's Tao, we also supplement the graph semantics required by the social graph and provide additional filtering parameters for anti-cheating scenarios. At the same time, at the cache level, we support configuring local secondary indexes in the cache for different fields. Here are some typical usage scenarios:

scene one:

Get all normal users who follow A (and eliminate cheating users)

getAssocs(“被关注类型”, 用户A的ID, 分页偏移量, 最大返回值, 只返回正常用户,是否按照时间从新到旧)

Scene two:

Get the number of fans of A (and eliminate cheating users)

getAssocCount(“被关注类型”, 用户A的ID, 只返回正常用户)

The architectural design of REDtao takes into account the following key elements:

3.1 Overall architecture

The overall architecture is divided into three layers: access layer, cache layer and persistence layer. Business parties access services through REDtao SDK. As shown below:

In this architecture, unlike Facebook Tao, our cache layer is an independent distributed cluster and is decoupled from the persistence layer below. The cache layer and the persistence layer below can be expanded and reduced independently. Cache shards and MySQL shards do not need to correspond one-to-one, which brings better flexibility. The MySQL cluster has also become a pluggable and replaceable persistent server. storage.

Read process: The client sends the read request to the router. After receiving the RPC request, the router selects the corresponding REDtao cluster according to the edge type, and calculates the Follower where the shard is located based on the triplet (FromId, AssocType, ToId) through consistent Hash. node, forward the request to this node. When the Follower node receives the request, it first queries the local graph cache and returns the result directly if there is a hit. If there is no hit, the request is forwarded to the Leader node. Similarly, the Leader node is returned if it is hit, and if it is not hit, the underlying MySQL database is queried.

Write process: The client sends the write request to the router, and like the read process, it will be forwarded to the corresponding Follower node. The Follower node will forward the write request to the Leader node, and the Leader node will forward it to MySQL. When MySQL returns that the write is successful, the Leader will clear the Key corresponding to the local map cache and clear the Key to all other Followers to ensure the final consistency of the data. sex.

3.2 High availability

REDtao is divided into two independent layers: cache layer and persistence layer. Each layer guarantees high availability.

Self-developed distributed cache:

We have self-developed a distributed cache cluster that implements graph semantics and supports automatic fault detection and recovery, as well as horizontal expansion and contraction.

It is a double-layer cache, and each shard has a Leader and several Followers. All requests are first sent to the outer Follower, and then forwarded to the Leader by the Follower. The advantage of this is that when the reading pressure is high, you only need to expand the Follower horizontally. The single-point Leader writing method also reduces the complexity and makes it easier to achieve data consistency.

If one replica fails, the system switches over within seconds. When the persistence layer fails, the distributed cache layer can still provide external read services.

Highly available MySQL cluster:

MySQL cluster implements the database and table sub-scheme through self-developed middleware, and supports the horizontal expansion of MySQL. Each MySQL database has several slave databases and is consistent with other MySQL operation and maintenance solutions within the company to ensure high availability.

Current limiting protection function:

In order to prevent cache breakdown from causing MySQL to burst into a large number of requests, causing MySQL to crash, we implement current limiting to protect MySQL by limiting the maximum number of concurrent MySQL requests for each master node. Requests after reaching the maximum concurrent request limit will be suspended until an existing request is processed and returned, or the request will be rejected after reaching the waiting timeout and will not be continued to MySQL. The current limiting threshold is adjustable online, and the corresponding limit is adjusted according to the size of the MySQL cluster.

In order to prevent crawlers or cheating users from frequently brushing the same piece of data, we use REDtaoQueue to sequentially execute requests to write or check the same edge. The queue length will be limited to control the execution of a large number of the same requests at the same time. Compared with the way a single global queue controls all requests, a per-request queue can well limit a single same request without affecting other normal requests.

3.3 Ultimate performance

The design of data structure is an important guarantee for REDtao's high performance. We adopted a three-layer nested HashTable design. By searching REDtaoGraph from the first-level HashTable based on a certain starting point from_id, we recorded all corresponding outbound edge information under all types. Then, in the second-level HashTable, the count, index and other metadata of all outgoing edges of the AssocType corresponding to a certain type are found based on a certain type_id. Finally, in the last level HashTable, the final side information is found through a certain to_id of AssocType. We record the creation time, update time, version, data and REDtaoQueue, and time_index corresponds to sorting the list according to the creation time. The last level HashTable and index limit store the latest 1000 edge information to limit super points from occupying too much memory, while focusing on improving the query hit rate and efficiency of the latest hot data. REDtaoQueue is used to queue the reading and writing of a current relationship, and only records the metadata of the last current request. Every time you query or write, REDtaoAssoc is first queried. If the cache does not exist, an object containing only REDtaoQueue will be created first; if the cache already exists, the queue metadata will be updated, setting itself as the last request of the queue, and Pending waiting to be executed.

Through this multi-layer hash + skip list design, we can efficiently organize the relationships between points, edges, indexes, and time-series linked lists. Memory application and release are completed on the same thread.

In an online environment, our system can reach 1.5 million query requests/s on a 16-core cloud vendor virtual machine, while the CPU utilization is only 22.5%. Below is a monitoring chart of the online cluster. The QPS of a single machine reaches 3w, and each RPC request aggregates 50 queries.

3.4 Ease of use

Rich graph semantic API:

We have encapsulated 25 graph semantic APIs in REDtao for the business side to use, which meets the business side's needs for addition, deletion, modification and query. Business parties do not need to write SQL statements by themselves to implement corresponding operations, and the usage is simpler and more convergent.

Unified access URL:

Since the community backend data is too large, we split it into several REDtao clusters according to different services and priorities. In order to prevent the business side from being aware of the back-end cluster splitting logic, we implemented a unified access layer. Different business parties only need to use the same service URL and send requests to the access layer through the SDK. The access layer will receive requests for graph semantics from different business parties and route them to different REDtao clusters based on edge types. By subscribing to the configuration center, it can sense the routing relationships of edges in real time, thereby realizing a unified access URL and making it convenient for business parties to use.

3.5 Data consistency

As social graph data, data consistency is crucial. We need to strictly ensure the eventual consistency of data and strong consistency under certain scenarios. To this end, we have taken the following measures:

Resolution of cache update conflicts:

REDtao generates a globally incremented unique version number for each write request. When updating the local cache with MySQL data, the version number needs to be compared. If the version number is lower than the cached data version, the update request will be rejected to avoid conflicts.

Read-after-write consistency:

Proxy will route point or edge requests with the same fromId to the same read cache node to ensure read data consistency.

Master node abnormality scenario:

After the leader node receives the update request, it will turn the update request into an invalidate cache request and send it asynchronously to other followers to ensure that the data on the followers is ultimately consistent. Under abnormal circumstances, if the queue sent by the Leader is full and the invalidate cache request is lost, all other follower caches will be cleared. If the leader fails, the newly elected leader will also notify other followers to clear the cache. In addition, Leader will limit the flow of requests to access MySQL to ensure that even if the cache of individual shards is cleared, MySQL will not be crashed.

A small number of strongly consistent requests:

Since the MySQL slave library also provides read services, for a small number of read requests that require strong consistency, the client can dye the request with a special flag. REDtao will transparently transmit the flag, and the database Proxy layer will forward the read request to the MySQL master based on the flag. database to ensure strong consistency of data.

3.6 Multi-activity across clouds

Cross-cloud multi-activity is an important strategy of the company and an important feature supported by REDtao. The overall cross-cloud multi-active architecture of REDtao is as follows:

This is different from the cross-cloud multi-active implementation of Facebook Tao. The cross-cloud multi-active implementation of Facebook Tao is as follows:

Facebook's solution relies on the underlying MySQL master-slave replication to be done through DTS Replication. While MySQL's native master-slave replication is its own function, the DTS service does not include MySQL's master-slave replication. This solution requires certain modifications to MySQL and DTS. As mentioned earlier, our cache and persistence layer are decoupled and different in architecture.

Therefore, REDtao's cross-cloud multi-active architecture is a design based on our own scenarios. It realizes the cross-cloud multi-active function without changing the existing MySQL functions:

1)  In the persistence layer, we use MySQL's native master-slave binlog to synchronize data to slave libraries in other clouds. Write requests on other clouds and a small number of strong consistent reads will be forwarded to the main database. A normal read request will read the MySQL database in this area to meet the latency requirements of the read request.

2)  The data consistency of the cache layer is achieved through the MySQL DTS subscription service. The binlog is converted into an invalidate cache request to clean up the stale data of the REDtao cache layer in this area. Since the read request will randomly read any MySQL database in this area, DTS subscription uses a delayed subscription function to ensure that the log is read from the node with the slowest binlog synchronization, avoiding DTS's invalidate cache request and the read cache in this area. Miss requests conflict, resulting in data inconsistency.

3.7 Cloud native

REDtao's cloud-native features are mainly reflected in several aspects such as elastic scaling, support for multi-AZ and Region data distribution, and product migration between different cloud vendors. REDtao was designed from the beginning to support elastic expansion and contraction, automatic fault detection and recovery.

As Kubernetes cloud-native technology becomes more and more mature, we are also thinking about how to use the capabilities of k8s to unbundle deployment and virtual machines to further cloud-native and facilitate deployment and migration between different cloud vendors. REDtao implements an Operator running on a Kubernetes cluster to achieve faster deployment, expansion, and replacement of failed machines. In order for k8s to be aware of the allocation of cluster shards and control the scheduling of Pods under the same shard on different hosts, the cluster group shard allocation is rendered by k8s Operator and controls the creation of DuplicateSet (a self-developed k8s resource object developed by Xiaohongshu). REDtao will create a master-slave and create a cluster based on the shard information rendered by the Operator. If a single Pod fails and restarts, it will rejoin the cluster without re-creating the entire cluster. When the cluster is upgraded, the Operator controls the order of slave first and then master by sensing the master-slave allocation, and rolls the upgrade in the order of shard allocation to reduce the online impact during the upgrade.

Any change is not easy. Implementing the new REDtao is just the relatively easy part. Xiaohongshu's social graph data service has been running on MySQL for many years, with many different businesses running on it. Any small problem will affect Xiaohongshu's online users. Therefore, how to migrate existing services to REDtao without any interruption has become a very big challenge. There are two key points in our migration work:

● Split the old large MySQL cluster into four REDtao clusters according to priority. In this way, we can first migrate the lowest-priority service to a REDtao cluster, and then migrate the high-priority cluster after sufficient grayscale.

● A specially developed Tao Proxy SDK supports dual-write and dual-read, and data verification and comparison of the original MySQL cluster and REDtao cluster.

During the migration, we first migrated the low-priority data from MySQL to a REDtao cluster through the DTS service, and upgraded the business side's SDK. The DTS service always synchronizes incremental data. The business side SDK will subscribe to the configuration changes in the configuration center. We modify the configuration to allow Tao Proxy SDK to read and write the MySQL cluster and REDtao cluster at the same time, and close the DTS service. At this time, the results of the MySQL cluster will be returned to the user.

When stopping the DTS service, there may be new MySQL data synchronized through DTS, causing the newly written data in the REDtao cluster to be overwritten by the old data synchronized. Therefore, after shutting down the DTS service, we will use the tool to read the binlog in the period from the start of double writing to the time when the DTS service is shut down to verify and repair the data.

After the repair is completed, Tao Proxy SDK's double reading will display inconsistent data amounts on both sides, and filter out requests with inconsistent data due to inconsistent double writing delays. After a period of grayscale, it was observed that the number of diffs was basically 0, so the configuration of Tao Proxy SDK was changed to read-write only for the new REDtao cluster.

Finally, in early 2022, we completed the migration and correctness verification of all the core social graph trillion-edge-level data of Xiaohongshu, and made the entire migration service imperceptible, and no failures occurred during the migration process.

In our social graph data access, more than 90% of the requests are read requests, and the social graph data has very strong temporal locality (that is, the recently updated data is the easiest to access). After REDtao went online, it achieved a cache hit rate of more than 90%, reduced MySQL's QPS by 70%+, and greatly reduced MySQL's CPU usage. After reducing the number of MySQL replicas, the overall cost was reduced by 21.3%.

All business access methods have converged on the API interface provided by REDtao. During the migration process, we also managed some old and unreasonable ways of accessing the MySQL database, as well as the unreasonable practice of customizing certain fields to give special meanings. Through REDtao standardizes data access.

Comparing the beginning of 2022 and the beginning of 2023, with the growth of DAU, social graph requests have increased by more than 250%. If it is the old MySQL architecture, expansion resources are basically proportional to the request growth rate, and at least 1x the resource cost of expansion is required. (tens of thousands of cores). Thanks to the existence of the REDtao system and its 90% cache hit rate, the overall cost actually increased by only 14.7% (thousands of cores) to handle the 2.5 times increase in requests. The cost and stability have been greatly improved.

In a short period of time, we developed a self-developed graph storage system REDtao to solve the problem of rapid growth of social graph relationship data.

REDtao draws on FaceBook Tao's paper and has made many improvements to the overall architecture and cross-cloud multi-activity. It has newly implemented a high-performance distributed graph cache, which is more in line with our own business characteristics and provides better elasticity. At the same time, k8s capabilities are used to further realize cloud nativeization.

As DAU continues to grow, the scale of trillions of data continues to grow, and we are also facing more technical challenges. At present, the company's internal OLTP graph scenario is mainly divided into three parts:

●Social graph data service: The self-developed graph storage system REDtao solves the problem of updating and associated reading of ultra-large-scale data in social scenarios. Trillions of relationships have been stored.

●Risk control scenarios: Through the self-developed graph database REDgraph, multi-hop real-time online query is satisfied. Currently, hundreds of billions of point and edge relationships are stored, which can satisfy queries of 2 hops or more. (We will share the introduction of REDgraph in the next article)

●Social recommendation: This area is mainly a two-hop query. All data is imported in batches through Hive every day, and updated data is written in near real-time through the DTS service. Because it is an online scenario, the requirements for latency are very high. The current REDgraph cannot meet such high requirements, so the business side mainly uses REDkv for storage.

In response to the above scenarios, in order to quickly meet business needs, we used three different self-developed storage systems: REDtao, REDgraph and REDkv. Obviously, compared to three sets of storage systems, it is more appropriate to use a unified architecture and system to solve these graph-related scenarios. In the future, we will integrate REDgraph and REDtao into a unified database product to create the industry's top graph technology and empower more scenarios within the company. Finally, we welcome like-minded students who have the ultimate pursuit of technology to join us.

Hollow : Xiaohongshu infrastructure storage group, responsible for the research and development of graph storage system REDtao and distributed cache

Liu Bei : Head of the infrastructure storage group of Xiaohongshu, responsible for the overall architecture and technology evolution of REDkv / REDtao / REDtable / REDgraph

The infrastructure-storage group provides stable and reliable storage and database services to Xiaohongshu’s business departments to meet the business’s functional, performance, cost and stability requirements for storage products.

Currently responsible for self-developed distributed KV, distributed cache, graph storage system, graph database and table storage. Online storage products include:

● REDkv: distributed high-performance KV

● REDtao: a high-performance graph storage database that meets one-hop queries

● REDtable: table storage that provides Schema semantics and secondary indexes

● REDgraph: Provides graph semantic query database with two hops and above

Guess you like

Origin blog.csdn.net/REDtech_1024/article/details/130482830