爬虫技术:去重知识点

1. 去重的场景

  url去重:防止发送重复请求

  数据文本去重:防止储存重复数据

2.数据去重的原理

  什么类型的数据:

  重复的依据是什么:

  例如:  data1 = ["123",123,"456","qwe","qwe"]

  列表去重方法:

# 方法一:集合法:乱序
data = ["123",123,"qwe","qwe","456","123"]
ret = list(set(data))
print(ret)

# 方法二:字典键值法:有序
data = ["123",123,"qwe","qwe","456","123"]
# {'123': None, 123: None, 'qwe': None, '456': None}
ret_dict = {}.fromkeys(data)
# dict_keys(['123', 123, 'qwe', '456'])
ret_list = ret_dict.keys()
# ['123', 123, 'qwe', '456']
print(list(ret_list)))

# 方法三:循环判断法:有序
demo_list = list()
for i in data:
    if i not in data:
        demo_list.append(i)
# ['123', 123, 'qwe', '456']
print(demo_list)

  例如:  data1 = ["123",123,"456","qwe","qwe"]

  限制:"123"和123是重复的,进行去重

data = ["123",123,"qwe","qwe","456","123"]
ret_list = list(set[str(i) for i in data])
print(ret_list)

  例如:对象去重

class Test(object):
def __init__(self,v):
  self.v = v
t1 = Test(100)
t2 = Test(100)
t3 = Test(200)
t4 = t1
data = [t1,t2,t3,t4]
# [<__main__.Test object at 0x000000000227E208>, <__main__.Test object at 0x000000000227E2B0>, <__main__.Test object at 0x00000000026FE0F0>]
print(list(set(data)))

需求:剔除重复数据,Test对象的v相同则为重复数据
ret_list = list()
ret_set = set()
for i in range(len(data)):
if data[i].v not in ret_list:
ret_list.append(data[i].v)
ret_set.add(data[i])
# {<__main__.Test object at 0x00000000004E9320>, <__main__.Test object at 0x00000000004E9E48>}
print((ret_set))

需求:剔除重复数据,Test对象的继承的类相同则为重复数据
ret_list = list()
ret_set = set()
for i in range(len(data)):
if data[i].__class__ not in ret_list:
ret_list.append(data[i].__class__)
ret_set.add(data[i])
# {<__main__.Test object at 0x00000000026A9320>}
print((ret_set))

  对即将产生的数据进行去重:容器去重(存储判断依据)

  data2 =  ["123",123,"456","qwe","qwe"]

  ret_list = []

  for data in data2:

    if data not in ret_list:

      ret_list.append(data)

  print(ret_list)

  

  

  

猜你喜欢

转载自www.cnblogs.com/meloncodezhang/p/11483748.html