Another problem just appeared online. . . hot

Hello everyone, my name is yes.

I'm here to send an online survey again!

Here's the thing, my colleague gave me a feedback today.


Because our application needs to synchronize the order information from the third party, if the user has not entered the order page for a period of time, it will automatically perform a full operation of pulling the order from the third party after entering it again, which may update the order in time. information to prevent users from manipulating expired orders.

Recently, this colleague found that every time the order list is clicked, a full pull will be triggered, which is obviously unreasonable and consumes resources for back-end tasks.

At first, I thought it had nothing to do with me, maybe there was a bug in the front-end code (haha, I thought the same way last time).

So I informed my colleague at the front end. After investigation, he told me that the code must be fine. Only users who have not synced their orders for more than an hour will trigger the pull when they enter the order page again.

I saw the way he made a promise, and I believed it. No way, I had to do my own research.

In this research, I really found the problem, and traced back to the source, it was actually caused by a problem I encountered before, it is really a link!

start investigating

I first logged in to the test account and found that I couldn't reproduce the situation that my colleague said that every click on the order list would trigger a full pull order.

Well, it's not good to be a teacher.

Immediately after communicating with him, I found out that it was an exception? So, find out the individual users who will have this situation.

In the simulation, when the full order pull task is executed, an error is actually reported, and the error reported is that the accessToken has expired.

We and third-party authorization go through oauth2

That is to say, the token authorized by the third party has expired, causing our order pull interface to report an error, so the task execution fails.

So, I doubted the code to refresh the token again, because we have a task to use refreshToken in advance to exchange for the latest token according to the expiration time of the token.

Therefore, it is reasonable to say that there will be an error that the token expires, so I visually guess that this must be a problem with the task of refreshing the token, causing the token to expire and the order pull task to fail. Then the front end will not record the failed task time, so when you enter the order page again and find that it has not been synchronized for more than an hour, the full pull is triggered immediately.

At this time, I wanted to find a colleague who was responsible for the task of refreshing the token. After looking around, I found that it was written by me...


I checked that the scheduled refresh task is indeed running, and it can only be that there is a problem with the request to refresh the token. I checked the log and it turned out to be a mistake!


I have seen this mistake before .

This is the error of calling the third-party refresh token interface, and then returned by the third-party. I didn't have any clue at the time, the code was missing, what code?


As can be seen from the above code, the interface to refresh the token only needs to pass these two parameters, and there is no other operation.

Moreover, when I saw this error, I immediately took the refreshToken to test the call locally, and found that there was no error at all, and the accessToken could be successfully returned.

And after many days of observation, I found that some users' refresh can be successful, and some can't.

Because the interface for refreshing the token is so simple, and the error is returned by the other party, and from the error message, it seems that it has nothing to do with me. Of course, I think there must be a problem with the other party's interface. How do I see it? leeway (remember this sentence).

Therefore, when I encountered this problem before, I said that I couldn't handle it, and I directly threw the blame to the third party (because the third party had many problems), who knows that it has come back now.

No way, I encountered this problem again, and now I can only try it locally with the user's refreshToken.

It just so happened that I used to look up refreshToken from the database and the library before. This time I used the company's internal tools to get it, and then I found Huadian!


Can you see that refreshToken is actually empty? ? ? I immediately checked on the database and found that there is data! !


I'm numb, I'm numb, so what happened? ?

I immediately went to check the code of the refresh token task and confirmed that my sql would indeed get refreshToken. Since there is a value in the database, I can "conclude" that refreshToken is definitely not empty when I go to refresh the task!

And suddenly, I found that this fetch is cached!


With such a flash of inspiration, I immediately went to check the cache and found that the refreshToken in the cache was empty. I wondered if it was some bastard who deleted the refreshToken in the cache.


Immediately, I rejected the idea that we shouldn't have such a need and fulfillment...

I had no idea. I went to see the code that the company's internal tools call to obtain the token, and found that it was calling an rpc interface. Since I didn't have the code for that service, I asked an old colleague. He was a little impressed and came here. sentence:

Well, I was caught by me, and I directly had a match with the colleague who modified it. Who knows that the other party only replied with three words:

I direct one:

So far the case has been solved...

This colleague's idea is as follows: he thinks that it is not necessary to use refreshToken to obtain token on weekdays, so he chose not to take refreshToken because of the rules of choosing what to take. There is no refreshToken value.

Then the authorization service was written at the beginning. At that time, the service A that the colleague was responsible for had not been extracted. Therefore, the acquisition and writing of the token were realized by the authorization service itself operating the database, so I am sure that my code is indeed Get the refreshToken from the database, and you will never think that the refreshToken will be empty.

The problem is that the two share a cache key. For the sake of saving, service A does not insert refreshToken into the cache when it obtains user authorization information, which causes the authorization service to obtain user authorization information because it hits the cache. Get the value directly from the cache, and there is no refreshToken value in the cache, so when calling the third-party refresh token interface, the value passed by refreshToken is empty!

So the third party returned an error:

At this point, I understand the meaning of this lack of code... I want to say that the error message returns the refreshToken parameter is empty, is it not good, give me the whole code, I don't know what code it is!

Then, for those users whose authorization service stuffed the cache before service A, their refresh authorization is normal, because the authorization service will stuff the refreshToken into the cache.

Well, after the investigation, the final solution is that service A also inserts the refreshToken into the cache.

At last

It can be seen that this investigation does not actually involve any advanced technology. In fact, it is a mistake caused by multi-party linkage and ill-consideration. In fact, most of the errors in the production environment are some details, such as incorrect parameter configuration, writing an extra judgment, and so on.

Let's summarize this experience:

  • Data acquisition should consider the correctness of the cache, not just the database, don't forget the cache
  • The operation of convergent services, that is, the service division is clear and independent, try not to implement the functions of other services internally, so that when the requirements change, you can avoid excessive changes and missed changes, and the above problems will not occur, unified constraints, the most comfortable
  • The error message is clear. If the above error is not missing code but the refreshToken parameter is empty, I may have finished the investigation when I saw this error for the first time, and I don’t have to wait until now (the trust value is also very important, there are many errors, gradually distrust each other's services)
  • Global awareness is key. Even if you are in charge of only one service, if you have the opportunity, you should learn more about other people's services, especially your own upstream and downstream, so that when something goes wrong, your brain can scan the whole situation clearly and quickly locate where problems may be found. The difference between cattle and ordinary people (you can't handle it, others can do it in two minutes).

That's about it. If you have needs, you can also use this experience for interviews, hahaha, don't be polite to me!

I'm yes, from a little bit to a million dots, let's look forward to the next online investigation together !

Guess you like

Origin blog.csdn.net/yessimida/article/details/121662988