Just read this article on how to organize the Double Eleven promotion after 2000! | JD Cloud Technology Team

introduction

Hello everyone, I am Wang Mengen, a post-00s generation who is “rectifying the workplace”. As a school recruiter who just joined JD.com last year, I am honored to be the person in charge of 11.11 preparations for this CDP platform. Although I had experienced major sales as early as my internship, the actual preparation of the entire department was still very unforgettable. So I picked up my pen and made a summary of the big sale for myself, recording my experiences, feelings, and gains during the 11.11 big sale.

11.11 Cognitive changes

I remember that my impression of 11.11 when I was in college was that I stayed up late with my roommates and placed crazy orders at the hour. I was lucky enough to be among the top 500 shoppers in Nangang District, Harbin.

A year ago, I officially joined JD.com and said goodbye to campus life. At that time, on the day of 11.11, everyone was very busy and worked overtime until very late. For me, 11.11 means free dinner and a lot of delicious food (really delicious warning⚠️).

During this year’s 618 promotion, I participated in the preparation work of our department. I am responsible for the stress test of one of the interfaces, as well as some work of configuring alarms and statistics.

After learning that I would be in charge of this big sale, I was full of excitement and motivation. At the same time, I also realized that I would face huge technical pressure and challenges. But it is also a very valuable technical exercise and growth opportunity.

What is the CDP platform?

Platform capabilities

The system Crowd Portrait (CDP) I am preparing for this time is a "user-centered" system that focuses on data fusion, label production, group data services (crowd hits, label values, crowd downloads), crowd analysis and insights, and crowd applications. Provide product capabilities with full service links, empower businesses to complete refined marketing and in-depth operations, and achieve data-driven smart growth in global operations. It is one of the core underlying services of JD Technology’s super-P0-level system and the most important on the technology side. A means to achieve precision marketing. At present, the portrait platform supports 16+ cross-BGBU secondary departments, with an average daily call volume of 10 billion+ times. It is applied to user acquisition, transaction conversion, activation and other core chains of various core businesses such as payment, consumer finance, and wealth. On the road.

Business scene

1. Golden link service supports more than 16+ cross-BGBU secondary departments, with an average daily call volume of 10 billion+ times.

2. Provide marketing recommendation services for Eagle Eye, Molo, Jingyin, Lego, etc.

3. Support decision-making for consumer finance, payment, financial APP and other businesses.

4. Provide real-time decision-making for marketing and wealth.

5. The cashier, business details, shopping cart, and settlement page can localize data services that support marketing decision-making scenarios by downloading group data.

How to organize and prepare for war

Important nodes

From the start of the big promotion kick-off meeting, I started planning the overall rhythm of our preparations.

Where is the challenge

The above content introduces how important the CDP platform is, so the core challenge in preparing for the portrait system is "how to ensure that the system stably provides high-performance services under large traffic and high concurrency conditions", which is mainly reflected in: stability and performance.

stability:

1. How to quickly recover from disaster recovery when the system encounters an emergency.

2. In the case of large traffic, how to control the traffic of the system to ensure the availability of the system.

Performance: How to ensure system performance (TP999: 50ms or less) under large data volumes with nearly one million levels of TPS traffic.

Traffic: The lowest level of the golden process, the traffic is expected to be amplified, and the overall traffic is estimated to be 98wTPS.

In fact, if you observe our daily traffic, you will find that we are conducting "big promotions" every day, and there will also be a surge in traffic every day.

How do I "trade"

System sorting

This stage is mainly to sort out the core applications participating in the big promotion. I think the most important point is to sort out what changes have been made after 618? Because the system is always iteratively upgraded. First, we must ensure that these changes will not affect the system. The second is that if there are new preparation interfaces that cannot be evaluated based on previous big promotion experiences, focus on traffic collection. After sorting out the system, it is necessary to conduct a single-machine stress test to evaluate whether the performance is up to standard based on the results of the stress test. It is also an answer to the iteration of this half-year. Finally, JSF single-machine current limiting (front-end single-machine interface layer) must be configured based on the stress test results-stability guarantee

capacity planning

My capacity planning is mainly divided into two aspects. One is to re-perform the single-machine stress test of the application to determine what is the maximum carrying capacity of our current single machine. The second is to collect the traffic of upstream and downstream business parties, and evaluate and calculate the amount based on these two aspects. How many resources will be needed to expand the capacity for the next major promotion? The application traffic is limited based on the collected traffic reports from the business side in order to ensure that most of the traffic is controllable) - stability guarantee

Disaster recovery record

Sort out all the system downgrade plans, achieve one-click switching for system core node disaster recovery, clear operation manuals, and quick operation-stability guarantee

Downgrade plan

The last big trick to make your system "unstable" is downgrading. I can sum it up in one sentence by maximizing limited resources. For example, our system will suspend the processing of groups and tags at peak points, as well as upstream non-reinsurance MQ jobs, in order to free up more resources for core programs to maximize the availability of core business.

Military exercise stress test

At this stage, the group will organize multiple online cluster unified stress tests (during this period, conduct all downgrade drills and ensure that all downgrade switches are available) - Stability guarantee

real time monitoring

I will reorganize and configure the alarm configurations of the key links of the system (telephone, buzz alarm), and I will also arrange dedicated personnel to prepare for each core service of the system and provide dedicated services to improve processing efficiency.

"Horrifying moment"

An alarm call was received at 13:51 on November 4th.

At 13:52 on November 4th, I checked that tp999 hitting the interface SGM increased sharply. At the same time, I immediately called the R2M operation and maintenance teacher to learn the reason.

At 13:53 on November 4th, I quickly used the disaster recovery switch of the operating system to switch the system to the backup link. My advice to you here is not to hesitate, don’t keep looking for the cause, make quick reactions and judgments, and minimize the risk. The impact of the reduction on online users.

As you can see from the picture below, the system returned to normal within two minutes.

Stability guarantee during the major promotion period is generally an emergency strategy, because when sorting out the system in the early stage, I sorted out the system's downgrade plan in detail and clearly defined the operation manual. The core nodes of the system also have disaster recovery capabilities and were downgraded during the group's organizational stress testing stage. Drilling, so when a problem occurs, we quickly perform downgrade and solve the interface performance problem in the shortest time.

Summary and insights

Preparing for 2023 JD.com 11.11 is a very valuable opportunity for learning and growth.

1. During the initial preparation stage of this preparation, I learned a lot of professional knowledge and also exercised my teamwork and problem-solving skills. Especially in the stage of sorting out the system architecture changes, we can understand why such changes occur from various angles (stability, cost, operation and maintenance).

2. On the day of the big promotion, everyone will gather in the conference room to prepare. I am indeed very nervous, but I will organize everyone to inspect all systems again, check the downgrade list, and the current limiting configuration of other systems. For me, the 10 minutes around 8pm is the slowest and most stressful time I have ever spent. During these 20 minutes, I will notify users of downgrades, operating system downgrades, observe online system monitoring, restore downgrades, and make a table so that each team member has special monitoring items to ensure the stability of the system in real time. .

3. Finally, when the system runs normally at the peak point and successfully copes with high concurrency pressure, I will also feel a huge sense of accomplishment and satisfaction. A big sale is really tiring, and it takes about 2 months to prepare. During the big sale, it lasts from several days to midnight or even all night. But from a certain point of view, the big promotion is more like a technical test, a festival for all personnel. It is impossible to understand it if you have not experienced it, and you cannot fully understand it if you have not experienced it many times, so enjoy it.

finally finally

Attached is a photo with each team leader.

Author: Jingdong Technology Wang Mengen

Source: JD Cloud Developer Community Please indicate the source when reprinting

Microsoft launches new "Windows App" .NET 8 officially GA, the latest LTS version Xiaomi officially announced that Xiaomi Vela is fully open source, and the underlying kernel is NuttX Alibaba Cloud 11.12 The cause of the failure is exposed: Access Key Service (Access Key) exception Vite 5 officially released GitHub report : TypeScript replaces Java and becomes the third most popular language Offering a reward of hundreds of thousands of dollars to rewrite Prettier in Rust Asking the open source author "Is the project still alive?" Very rude and disrespectful Bytedance: Using AI to automatically tune Linux kernel parameter operators Magic operation: disconnect the network in the background, deactivate the broadband account, and force the user to change the optical modem
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10148334