Stepping on the pit: I thought it was a Redis cache, but I didn't expect it to be a Spring matter!

foreword

  I encountered a bug recently, which tossed me for several days. And this bug is not inevitable, the probability of occurrence is relatively low. At first I thought it was a problem with the old data, so I asked the test to regenerate the data and retest. Since the subsequent rounds of tests did not appear, I didn't pay much attention.

  It's a pity that the good times didn't last long, and the problem reported by the test last time appeared again. So I immediately started to investigate. According to the performance of the log, it was determined that there was a problem with the third-party service. But I'm not very sure, so let the test continue to observe.

  However, it appeared again today, and this time it was not caused by a third-party service. So I started to review the code line by line for troubleshooting. At first, I thought that the cache maintenance strategy was wrong, which led to data inconsistency between the database and redis. But after further analysis of the log, it was found that the problem was not in Redis but in Spring transactions.

scene introduction

  The business scenario is as follows: the user binds the device, needs to be displayed in the device list, and can view the device information.

  When a user binds a device, I need to add a binding record in the database. Then modify the user's policy and add the current device to the user's policy, so that you can view the device information.

  If the user binds the same device again, the original record will be unbound and a new binding record will be generated. Since the same device overwrites the binding, the user policy will not be modified.

  If it is on the device side or the mobile phone side, perform the unbinding operation. Then the server will change the state of the binding record to 解绑, and the user policy will also delete the current device. In this way, the device information cannot be seen.

code example

code structure

@Slf4j
@Service
public class DeviceUserServiceImpl {

    @Resource
    private DeviceUserServiceImpl self;
    
    /**
     * 绑定操作
     */
    @Transactional(rollbackFor = Exception.class)
    public DeviceBindVo doBind(DeviceBindBo bo, DevicePo device) {
      //  绑定逻辑
    }
    
    
    /**
     * 设备解绑
     */
    @Override
    @Transactional(rollbackFor = Exception.class)
    public void unbind(DeviceUnbindBo bo) {
        DeviceUserPo deviceUser = self.getBindInfo(bo.getDeviceId(), bo.getUserId());
        self.unbind(deviceUser, true);
    }

    /**
     * 设备解绑,公共逻辑
     */
    public void unbind(DeviceUserPo deviceUser, boolean modifyPolicy) {
        // 解绑逻辑
    }
    
    /**
     * 获取绑定记录
     */
    @Override
    @Cacheable(value = RedisPrefixConst.DEVICE_USER, key = "#deviceId")
    public DeviceUserPo get(Long deviceId) {
        // 获取绑定记录
    }
}    

binding logic

@Transactional(rollbackFor = Exception.class)
public DeviceBindVo doBind(DeviceBindBo bo, DevicePo device) {
    // 获取设备绑定信息,判断设备是否被绑定
    DeviceUserPo oldDeviceUser = self.get(bo.getDeviceId());

    boolean modifyPolicy = true;

    if (oldDeviceUser != null) {
        self.unbind(oldDeviceUser);
        modifyPolicy = false;
        log.info("设备绑定->已完覆盖绑定的解绑流程");
    }

    // 新增绑定记录
    DeviceUserPo newDeviceUser = new DeviceUserPo();
    log.info("设备绑定->已生成新的deviceUser数据: deviceUserId={}", newDeviceUser.getDeviceUserId());

    // 删除缓存
    self.deleteCache(bo.getUserId(), bo.getDeviceId());

    // 根据需求,更新策略
    if (modifyPolicy) {
        certService.modifyUserPolicy(bo.getUserId());
        log.info("设备绑定->已更新用户证书策略.");
    }

    // 返回绑定信息
    return new DeviceBindVo();
}

unbundling logic


/**
 * 设备解绑
 *
 * @param bo 参数
 */
@Override
@Transactional(rollbackFor = Exception.class)
public void unbind(DeviceUnbindBo bo) {
    DeviceUserPo deviceUser = self.getBindInfo(bo.getDeviceId(), bo.getUserId());
    self.unbind(deviceUser, true);
}


public void unbind(DeviceUserPo deviceUser, boolean modifyPolicy) {
    // 解绑
    Assert.isTrue(
            // 执行解绑SQL,
            () -> new BizException(DeviceCodeEnum.DEVICE_NOT_BOUND)
    );
    log.info("设备解绑->已更新数据库deviceUser的状态为unbind: deviceUserId={}", deviceUser.getDeviceUserId());

    // 删除缓存
    self.deleteCache(deviceUser.getUserId(), deviceUser.getDeviceId());

    // 更改策略
    if (modifyPolicy) {
        certService.modifyUserPolicy(deviceUser.getUserId());
        log.info("设备解绑->已更新用户证书策略完成: userId={}", deviceUser.getUserId());
    }
}

Get binding information

@Override
@Cacheable(value = RedisPrefixConst.DEVICE_USER, key = "#deviceId")
public DeviceUserPo get(Long deviceId) {
    // 查询SQL
}

Troubleshooting process

  First of all, our business scenario is: the user binds the device, which needs to be displayed in the device list, and can be clicked to view the device information. The bug scenario is: the device has been bound successfully and is displayed in the device list, but the device information cannot be viewed.

Wrong Conclusion: Third-Party Service Issues

  Why do you think so? First of all, if you cannot view the device information, it must be caused by a problem with the policy. But I checked the policy of this user, and it has access to the device. Then I checked the documentation of the third-party service and said that modifying the user policy may take several minutes to take effect.

   我问他们有没有试过等几分钟,再查看设备信息。他们说没有,重新绑定了一下就正常了。所以我就回复他们可能是三方服务策略生效时间延迟导致的。

  其实当时,如果他们过几分钟,再测试查看设备信息,如果能正常查看,那就说明是第三方服务策略生效延迟导致的。但是他们并不知道,策略修改以后,可能会延迟生效,就没有做这个场景的测试。

  由于出现Bug了,他们就尝试复现,重新绑定了设备。但是这个Bug不是必现的,接连好几次都成功了,并没有复现出来。所以他们将出现的异常情况告知了我。于是我就开始排查了,但是在排查过程中我忽略了一个关键点,就是他们为了复现Bug,重新测试绑定流程,并且都成功了。这也为我后面得出这个错误结论埋下了一个伏笔。

  由于我忽略了那个关键点,在排查过程中发现用户是有该设备的策略的。现在回过头来看,发现当时大脑估计是短路。因为他们在复现的过程中并没有出现失败,都是成功的,所以策略里面肯定是该设备的。由于策略里面有该设备,并且第三方服务的文档有提到策略可能会延迟生效,所以就得出了第三方服务有问题的结论。

  但是我对这个结论不是非常确定,所以让他们继续观察。并且跟他们说,如果再次出现不要做任何操作。通知我进行排查。然而今天又出现了经过排查发现是策略缺失。所以就排除是第三方服务出问题引起的了。

真实的原因

  既然排除了是策略未生效的问题,发现是策略缺失了。正常情况下,绑定成功了会修改用户的策略,那么为啥没修改呢?

  通过观察绑定代码发现,不修改用户策略只有一种情况下会产生,就是发现设备已经被绑定了,在进行覆盖绑定就不会修改策略。但是实际情况,设备已经解绑了,再进行绑定。按理来说是获取不到已经绑定的信息的。

@Transactional(rollbackFor = Exception.class)
public DeviceBindVo doBind(DeviceBindBo bo, DevicePo device) {
    // 获取设备绑定信息,判断设备是否被绑定  --> 问题点
    DeviceUserPo oldDeviceUser = self.get(bo.getDeviceId());

    boolean modifyPolicy = true;

    if (oldDeviceUser != null) {
        self.unbind(oldDeviceUser);
        modifyPolicy = false;
        log.info("设备绑定->已完覆盖绑定的解绑流程");
    }

    // 新增绑定记录
    DeviceUserPo newDeviceUser = new DeviceUserPo();
    log.info("设备绑定->已生成新的deviceUser数据: deviceUserId={}", newDeviceUser.getDeviceUserId());

    // 删除缓存
    self.deleteCache(bo.getUserId(), bo.getDeviceId());

    // 根据需求,更新策略
    if (modifyPolicy) {
        certService.modifyUserPolicy(bo.getUserId());
        log.info("设备绑定->已更新用户证书策略.");
    }

    // 返回绑定信息
    return new DeviceBindVo();
}

  那么为什么还能获取到,已经绑定的信息呢?由于get方法是加了缓存的,如果还能获取,也就说明在解绑的时候没有清除缓存。导致在绑定的时候,误以是覆盖绑定,才没有去修改策略,导致问题的出现。

@Override
@Cacheable(value = RedisPrefixConst.DEVICE_USER, key = "#deviceId")
public DeviceUserPo get(Long deviceId) {
    // 查询SQL
}

  通过观察解绑逻辑,发现是先更新数据库,再进行删除缓存。虽然在高并发下,可能在极短时间数据库已经解绑了,但是缓存还没来得及清除,获取到的还是已绑定的状态。

  但是对于我这个场景来说是不可能的出现的。由于从解绑设备,到操作设备进入绑定模式,再进行绑定。整个操作的耗时,缓存早就被清理了。并且通过查看接口日志,也发现缓存缺失是被删除了。那么为什么缓存里面还存有绑定信息呢?

  后来发现是其他线程的会获取调用get()方法,获取绑定信息做逻辑处理。由于解绑时删除了缓存,所以这个时候会从数据库里面查询最新的绑定信息并加载进缓存。按理来说这个时候,查询到的应该是解绑的状态,而不是绑定状态。

  在进行代码审查的,我看到unbind(DeviceUnbindBo bo)上有事务,unbind(DeviceUserPo deviceUser, boolean modifyPolicy)没有事务。并且是由自身的代理对象self调用的。根据Spring的事务传播性来讲,最外层开启了事务,并且通过代理对象调用内部方法,该内部方法也是具有事务的。所以说当unbind方法内的所有逻辑执行完后事务才会提交。

/**
 * 设备解绑
 *
 * @param bo 参数
 */
@Override
@Transactional(rollbackFor = Exception.class)
public void unbind(DeviceUnbindBo bo) {
    DeviceUserPo deviceUser = self.getBindInfo(bo.getDeviceId(), bo.getUserId());
    self.unbind(deviceUser, true);
}

public void unbind(DeviceUserPo deviceUser, boolean modifyPolicy) {
    // 解绑
    Assert.isTrue(
            // 执行解绑SQL,
            () -> new BizException(DeviceCodeEnum.DEVICE_NOT_BOUND)
    );
    log.info("设备解绑->已更新数据库deviceUser的状态为unbind: deviceUserId={}", deviceUser.getDeviceUserId());

    // 删除缓存
    self.deleteCache(deviceUser.getUserId(), deviceUser.getDeviceId());

    // 更改策略
    if (modifyPolicy) {
        certService.modifyUserPolicy(deviceUser.getUserId());
        log.info("设备解绑->已更新用户证书策略完成: userId={}", deviceUser.getUserId());
    }
}   

  到这里基本破案了,bug发生的过程如下:当服务端收到解绑请求时,先更改数据库的绑定状态,然后再删除缓存。在执行修改用户策略的时候,其他的线程来查询绑定信息,由于缓存已经被删除了,所以这个时候需要去数据库内查询最新的绑定信息。但是由于unbind方法具有事务,并且修改用户策略还未执行完,所事务并没有提交。导致查询到的还是旧的绑定信息,并将其写入缓存。

  这也就导致了,在重新绑定的时候,明明已经解绑了,获取到的还是绑定的状态。导致进行覆盖绑定,从而没有修改用户策略,设备绑定成功了,但无法查看设备详情。

解决方法

  解决方法非常简单,把@Transactional去掉即可。由于没有事务只要执行完更新SQL就提交了。所以避免在耗时的操作里加上事物,也就避免了上述问题的产生。

总结

  在实际开发中,我们可能一不小心就掉进了Spring事务的坑里了,所以对于事务我们需要特别小心。对于事务,并不是简单的加个@Transactional注解就行了。而是每加一个@Transactional都要认真思考,否则它可能会给你来点意外的惊喜。

结尾

  如果觉得对你有帮助,可以多多评论,多多点赞哦,也可以到我的主页看看,说不定有你喜欢的文章,也可以随手点个关注哦,谢谢。

  我是不一样的科技宅,每天进步一点点,体验不一样的生活。我们下期见!

Guess you like

Origin juejin.im/post/7240467781400641592