Python集合的应用之对账

本文为博主原创，未经授权，严禁转载及使用。
本文链接：https://blog.csdn.net/zyooooxie/article/details/106824322

最近有看到：字典和集合的创建都用{}，我们可以将集合看成是一个特殊的字典；字典和集合的key都是不能重复的，

忽然想起来之前我写的： Python字典的应用之大数据量的对账；
那是用字典来搞得，所以我能不能用集合来搞一搞呢？

个人博客：https://blog.csdn.net/zyooooxie

Python的set

因为我日常使用得很少，也特地做个学习小结；


def test_set():
    # 集合是由不重复元素组成的无序容器。基本用法包括成员检测、消除重复元素。集合对象支持合集、交集、差集、对称差分；

    # 创建空集合只能用set()，
    abc = set()
    print(abc, type(abc))

    abc1 = {
    
    }
    print(abc1, type(abc1))

    # 集合里的数据必须是可hash的，因此，集合里不能有字典，集合，列表这三种数据。
    # abc.add({'1': 2222})            # TypeError: unhashable type: 'dict'
    # abc.add({'1', 2222})            # TypeError: unhashable type: 'set'
    # abc.add(['1', 2222])            # TypeError: unhashable type: 'list'

    abc = {
    
    'apple', 'orange', 'apple', 'orange', 'banana'}
    print(abc, type(abc))

    a = set('abccccdddabcabcabc')
    b = set('abceeefffabcabcabc')

    print(a - b)            # letters in a but not in b     集合a中包含而集合b中不包含的元素
    print(a.difference(b))          # 返回多个集合的差集

    # print(b - a)
    # print(b.difference(a))

    print(a | b)            # letters in a or b or both         集合a或b中包含的所有元素
    print(a.union(b))           # 返回两个集合的并集
    print(b.union(a))

    print(a & b)            # letters in both a and b           集合a和b中都包含了的元素
    print(a.intersection(b))        # 返回集合的交集
    print(b.intersection(a))

    # print(a ^ b)            # letters in a or b but not both        不同时包含于a和b的元素
    # print(a.symmetric_difference(b))        # 返回两个集合中不重复的元素集合。
    # print(b.symmetric_difference(a))


def test_set2():
    a = set('abccccdddabcabcabc')

    # 从集合中移除数据，使用remove方法，如果集合中没有这项数据，那么remove将会引发异常，一个更安全的方法是discard，如果数据不存在，不会引发异常。

    # 删除集合中指定的元素
    a.discard()

    # 移除指定元素
    a.remove('abc')

    # --------------

    # 为集合添加元素；如果添加的元素在集合中已存在，则不执行任何操作
    a.add()

    # 移除集合中的所有元素
    a.clear()

    # 拷贝一个集合
    a.copy()            # Return a shallow copy of a set

    # 判断交集是否为空，
    a.isdisjoint()

    # 判断指定集合 是否为该方法参数集合的子集。
    a.issubset()

    # 判断该方法的参数集合是否 为指定集合的子集
    a.issuperset()

    # 随机移除元素
    a.pop()

    # 给集合添加元素 【可以是元素或集合，也可以 是列表，元组，字典】
    a.update()

    # difference()	返回多个集合的差集
    # difference_update()	在原集合上移除两个集合都存在的元素

    # intersection()	返回集合的交集；集合可以是多个
    # intersection_update()	在原集合上 移除 与其他集合不重复的元素

    # symmetric_difference()	返回两个集合中不重复的元素集合。
    # symmetric_difference_update()	移除当前集合中在另外一个指定集合相同的元素，并将另外一个指定集合中不同的元素插入到当前集合中。

    # 集合也支持推导式
    b = {
    
    x for x in 'abracadabra' if x not in 'abc'}

个人博客：https://blog.csdn.net/zyooooxie

对账代码

因为我早已经从那家公司离职，几百万数据也没法造，实际写的代码没法拿大数据量来验证，就只能 “随便搞一搞”。

【对账的一些概念可参考 https://blog.csdn.net/zyooooxie/article/details/103896922】

代码思路：
1，直接set2个list，取交集，结果为 2边都有-完全相同的；
2，直接set （只有第一个字段的2个list），取差集，为另一边不包含、缺失的；
3，将这3个结果的所有数据在2个list 删除掉；
4，直接set （前3字段的2个list），取交集，为前三个字段值相同、第四个字段值不同的结果；
5，再清掉这部分数据，再set（前2个字段的2个list）取交集，得结果；以此类推；

from user_log import Log

# gl_list1 、gl_list2  假设为对账的原始数据；
#（第一行是完全相同的，consistent_1000；第四行是彼此多余、对方缺失的，g_miss_1004、a_miss_1005；
# 第二、三行是某些字段值不同，对应其他几个对账结果）

gl_list1 = [('1f', '2s', '3t', '4f'), (12, '2second', '3t', '4f'),
            (1, '2s', '3t', '4f'), ('1f2', '2s', '3three', '4f'), ('1f3', '2s', '3three', '4four'),
            ('zy', '1', '2', '3'), ('zy1', '1', '2', '3'), ('zy2', '1', '2', '3'),
            ('1f42', '2s1', '3t1', '4f1'), ('1f7', '2s1', '3t1', '4f1')]

gl_list2 = [('1f', '2s', '3t', '4f'), (12, '2second', '3t', '4f'),
            (1, '2s', '3t', '4ff'), ('1f2', '2s', '3tt', '4f'), ('1f3', '2ss', '3t', '4ff'),
            ('zy', '1', '2', '33'), ('zy1', '1', '22', '3'), ('zy2', '11', '2', '3'),
            ('1f43', '2s', '3t', '4ff'), ('1f6', '2s11', '3t11', '4ff11')]


def test_set(list1: list, list2: list):
    # Log.info(list1)
    # Log.info(list2)

    set1 = set(list1)
    set2 = set(list2)
    res1 = set1.intersection(set2)
    Log.info('完全相同的consistent_1000：{}'.format(res1))

    set41 = set([i[0] for i in list1])
    set42 = set([i[0] for i in list2])
    res4 = set41.difference(set42)
    # Log.info(res4)
    res4 = list(filter(lambda x: x[0] in res4, list1))
    Log.info('g_miss_1004：{}'.format(res4))

    res5 = set42.difference(set41)
    # Log.info(res5)
    res5 = list(filter(lambda x: x[0] in res5, list2))
    Log.info('a_miss_1005：{}'.format(res5))

    del_res = set()
    del_res.update(res1, res4, res5)

    for r in del_res:
        try:
            list1.remove(r)

        except ValueError:

            try:
                list2.remove(r)
            except ValueError:
                pass

        else:

            try:
                list2.remove(r)
            except ValueError:
                pass

    # Log.info(list1)
    # Log.info(list2)

    set11 = set(list(map(lambda x: x[:3], list1)))
    set12 = set(i[:3] for i in list2)
    res3 = set11.intersection(set12)
    # Log.info(res3)

    res3a = list(filter(lambda x: x[:3] in res3, list1))
    Log.info('date_1003：{}'.format(res3a))

    for r in res3a:
        list1.remove(r)

    res3b = list(filter(lambda x: x[:3] in res3, list2))
    Log.info('date_1003：{}'.format(res3b))

    for r in res3b:
        list2.remove(r)

    # Log.info(list1)
    # Log.info(list2)

	# 省略后续步骤

有个重要的点忘记说了：对账时交易号（第一个字段）的值是不可能重复的，so 在原始数据list中每个元素都不会重复。

本文链接：https://blog.csdn.net/zyooooxie/article/details/106824322

交流技术欢迎+QQ 153132336 zy
个人博客 https://blog.csdn.net/zyooooxie