Python字典的应用之大数据量的对账

我所在的项目有个对账系统，最初我是写了脚本来校验对账结果的，对账的2张表的数据量大概是 100条、 100w条；这次提测，手动造了几百万的数据，成了100万条、 100万条数量级，执行我的脚本，却跑不出来结果，真的难住我了，做了些优化，分享下。

情景

这个系统是C-G的对账【原始数据共6个表】，线上C的一个月数据量超过600w，因为对账时间比较长，在更换技术方案后，提测给我；我在3个表每个都造了100-200w的数据，剩下3个表是爬虫爬取回来的，是20w-200w的数据量。

其实，想要看对账结果对不对，直接抽样N条记录，从对账结果表反推到原始数据表，对比预期结果，就能校验完。
但最初就是用的脚本来做，也就没按上面去做，非靠我的脚本来搞。
非作死不可。

最早一版

最初2个表数据量分别是 100条、100万条，我写的脚本跑的溜溜的，没问题。

思路如下：

list_a 是C系统的数据list；list_b是G系统的数据list；
将2个list中每个元素的4个字段进行匹配，来把符合条件的元素扔到对应的对账结果list 【共6个结果，如下图】

        new_list = [a for a in list_a if a in list_b]
        Log.info('Consistent-1000的有：{}'.format(new_list))

        new_list1 = [[a, b] for a in list_a for b in list_b if a[0] == b[0] and a[2] != b[2] and a[1] == b[1]]
        Log.info('Amounts Difference-1002的有：{}'.format(new_list1))

        new_list2 = [[a, b] for a in list_a for b in list_b if a[0] == b[0] and a[3] != b[3] and a[2] == b[2] and a[1] == b[1]]
        Log.info('Date difference-1003的有：{}'.format(new_list2))

        new_list3 = [[a, b] for a in list_a for b in list_b if a[0] == b[0] and a[1] != b[1]]
        Log.info('Status Difference-1001的有：{}'.format(new_list3))

        new_list4 = list()
        for a in list_a:
            for b in list_b:
                if a[0] != b[0]:
                    pass
                else:
                    break
                if b is list_b[-1]:
                    new_list4.append(a)
        Log.info('Miss G Bill-1004的有：{}'.format(new_list4))

        new_list5 = list()
        for b in list_b:
            for a in list_a:
                if b[0] != a[0]:
                    pass
                else:
                    break
                if a is list_a[-1]:
                    new_list5.append(b)
        Log.info('Miss C Bill-1005的有：{}'.format(new_list5))

第一版

这次优化是因为最早一版用列表生成式看似代码简化，实际至少要执行6次循环嵌套；在数据量少的时候，无所谓；可当一个list是100w的长度，可能一次循环嵌套就要100w X 100w；所以优化为第一版，执行2次循环嵌套，拿出6个结果；

        for a in list_a:
            for b in list_b:
                if a[0] != b[0]:
                    pass
                else:
                    if a == b:      # 4个字段全相同
                        consistent_1000.append([a, b])
                    elif a[0] == b[0] and a[3] != b[3] and a[2] == b[2] and a[1] == b[1]:
                        date_1003.append([a, b])
                    elif a[0] == b[0] and a[2] != b[2] and a[1] == b[1]:
                        amount_1002.append([a, b])
                    elif a[0] == b[0] and a[1] != b[1]:
                        status_1001.append([a, b])

                    break
                if b is list_b[-1]:
                    g_miss_1004.append(a)
        
        new_list5 = list()
        for b in list_b:
            for a in list_a:
                if b[0] != a[0]:
                    pass
                else:
                    break
                if a is list_a[-1]:
                    new_list5.append(b)

但即便如此，想跑出来结果还是没戏啊。。。

【下图是其中1条用例还没执行完的日志】
在这里插入图片描述

第二版

其实这次优化，真的是很为难，已经用习惯了list；感觉没啥思路。

和同事沟通过 + 看到资料说：选择合适的数据结构就可以实现优化，

在这里插入图片描述

所以才想到用字典；然后就被惊艳了【4个 100w X 100w的对帐差不多3分钟搞定】

怎么用呢？

比如说：把list_a、list_b的元素分别变为某字典的key，在第二个字典查找这第一个字典的某个key，找到就说明此key 2边都有，即这个元素 2边都有 =》这个元素属于对账结果中 4个字段值全相同的；

        dict_a_new = dict.fromkeys(list_a, True)
        dict_b_new = dict.fromkeys(list_b, True)
        # 所有字段全相同的元素 list
        consistent_1000 = [i for i in dict_a_new if i in dict_b_new]

那部分字段值相同呢？
我的思路：4个相同的(少) 和 3个相同的(多) 对比，某元素在3个相同但不在4个相同，就说明此元素第4个字段不同的；以此类推；

        dict_a_new = dict.fromkeys(list_a, True)
        dict_b_new = dict.fromkeys(list_b, True)
        # 所有字段全相同的元素 list
        consistent_1000 = [i for i in dict_a_new if i in dict_b_new]

        # 全相同元素的前三个字段 list
        new_consistent_1000_list = [(i[0], i[1], i[2]) for i in consistent_1000]
        # 全相同元素的前三个字段 dict
        new_consistent_1000_dict = dict.fromkeys(new_consistent_1000_list, True)

        # 前三个字段 list
        id_status_amont_list_a = [(i[0], i[1], i[2]) for i in list_a]
        id_status_amont_list_b = [(i[0], i[1], i[2]) for i in list_b]
        # 前三个字段 dict
        id_status_amont_dict_a = dict.fromkeys(id_status_amont_list_a, True)
        id_status_amont_dict_b = dict.fromkeys(id_status_amont_list_b, True)

        # 前三个字段相同值的元素 list
        id_status_amont = [i for i in id_status_amont_dict_a if i in id_status_amont_dict_b]
        # 前三个字段相同值的元素 dict
        id_status_amont_dict = dict.fromkeys(id_status_amont, True)

        # 在前三个相同，但不在四个字段全相同 -> date不同
        date_1003 = [i for i in id_status_amont_dict if i not in new_consistent_1000_dict]

最后看下这样的优化，实际执行的情况：【下图是 4条用例跑的】
在这里插入图片描述

交流技术欢迎 + QQ/微信 153132336 zy
个人博客 https://blog.csdn.net/zyooooxie

zyooooxie

发布了78 篇原创文章 · 获赞 24 · 访问量 3万+

私信关注

Python字典的应用之大数据量的对账

情景

最早一版

第一版

第二版

猜你喜欢