Useful Links python crawling search implementation code disc

This article introduces an effective link python crawling disk search, this article tell you in great detail by the example code, has a certain value for references, a friend in need under reference
because the disk search found out there are a lot of links has failed , and affect the efficiency of data looking for, so think of an effective link with the reptiles to filter out, the way to practice your hand -

This is the goal of this website crawling http://www.pansou.com, first of all to search for a python, then open the Developer Tools,

Json can be found in the data that we have to crawl under this data link, the extra parameters removed,

The remaining link format http://106.15.195.249:8011/search_new?q=python&p=1,q to search for content, p is the page number Here Insert Picture Description
of the following is the code to achieve:

import requests
import json
from multiprocessing.dummy import Pool as ThreadPool
from multiprocessing import Queue
import sys
headers = {
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"
}
q1 = Queue()
q2 = Queue()
urls = [] # 存取url列表
# 读取url
def get_urls(query):
  # 遍历50页
  for i in range(1,51):
    # 要爬取的url列表,返回值是json数据,q参数是搜索内容,p参数是页码
    url = "http://106.15.195.249:8011/search_new?&q=%s&p=%d" % (query,i)
    urls.append(url)
# 获取数据
def get_data(url):
  print("开始加载,请等待...")
  # 获取json数据并把json数据转换为字典
  resp = requests.get(url, headers=headers).content.decode("utf-8")
  resp = json.loads(resp)
  # 如果搜素数据为空就抛出异常停止程序
  if resp['list']['data'] == []:
    raise Exception
  # 遍历每一页数据的长度
  for num in range(len(resp['list']['data'])):
    # 获取百度云链接
    link = resp['list']['data'][num]['link']
    # 获取标题
    title = resp['list']['data'][num]['title']
    # 访问百度云链接,判断如果页面源代码中有“失效时间:”这段话的话就表明链接有效,链接无效的页面是没有这段话的
    link_content = requests.get(link, headers=headers).content.decode("utf-8")
    if "失效时间:" in link_content:
      # 把标题放进队列1
      q1.put(title)
      # 把链接放进队列2
      q2.put(link)
      # 写入csv文件
      with open("wangpanziyuan.csv", "a+", encoding="utf-8") as file:
        file.write(q1.get()+","+q2.get() + "\n")
  print("ok")
if __name__ == '__main__':
  # 括号内填写搜索内容
  get_urls("python")
  # 创建线程池
  pool = ThreadPool(3)
  try:
    results = pool.map(get_data, urls)
  except Exception as e:
    print(e)
  pool.close()
  pool.join()
  print("退出")

to sum up

The link above is effective to introduce a small series python crawling disk search implementation code we want to help,
thank you very much reading
of the University of time to choose a self-python, found that eating a working computer basic bad deficit, education is not no way to do this, can only be acquired to make up, then opened his own counter-attack in the road outside the coding, continuous learning python core knowledge, in-depth knowledge of basic computer learning, organized, I put in our study Python buckle qun: 250933691, if you are unwilling to mediocrity, it is with me outside of coding, growing it!

In fact, there is not only technical, more technical stuff than those, for example, how to make a fine programmer, rather than "Cock wire", the programmer itself is a noble presence, ah, is not it? [Click to join] want you want to be a noble person, come on!

Published 55 original articles · won praise 45 · views 80000 +

Guess you like

Origin blog.csdn.net/haoxun12/article/details/105300852