python crawler picture

A simple pathon crawling image code.

Summarize the problems encountered:

1. Crawl the website code when encountering the code crawling, the website can not be crawled

Solution: Modify User-Agent to display python by default

Modify the URL that you visit, how to find it and look down, just copy that

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'}

2. The source code of the crawled webpage is all garbled in Chinese, as shown in the figure

 

Just set the code of the crawled data

response.text.encode(response.encoding).decode('utf-8')

 

3. Crawl is empty

The reason may be that in the regular expression parsing of the web page, it is necessary to analyze the crawled structure and make a complete comparison.

 I can’t crawl here, the reason is

There are more spaces, but there are no spaces in the crawled data. Note: It is best to look at the crawled html for matching here, and find the tags that are matched regularly, because the structure seen in the source code of the web page is slightly different from the crawled structure, and the crawled will add some spaces to the structure, resulting in You cannot match the data.

4. An error is reported when saving the picture

The reason for the problem is that the path of the image I crawled is not set with http on the website, and python cannot request a valid address. You need to add it manually.

#循环爬取的url 挨个添加一下就好了
url = 'http:'+url

import requests
import re
import os
import time

url = "https://www.woyaogexing.com/tupian/mingxing/2018/41351.html"
# 遇到反扒需要设置
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win32; x32) AppleWebKit/588.6 (KHTML, like Gecko) Chrome/15.0.4007.15 Safari/537.36'}
# 修改自己身份
response = requests.get(url,headers=headers)

#.encode(response.encoding).decode('utf-8') 中文乱码解决
html = response.text.encode(response.encoding).decode('utf-8')
print(html)
""" 解析网页"""
dir_name = re.findall('<div class="pifutitle"><h1>(.*?)</h1></div>',html)[0]# 文件名
print(dir_name)
if not os.path.exists(dir_name): #判断是否有当前文件夹 否则创建
    os.mkdir(dir_name)

urls = re.findall('<img class="lazy" src="(.*?)"/>',html) # 正则表达式解析网页
# <img class="lazy" src="//img2.woyaogexing.com/2018/07/02/c0ddb88553a24141a7b471cae28967d8!600x600.jpeg"/>

"""保存图片"""
for url in urls:
    time.sleep(1) #延迟一秒
    file_name = url.split('/')[-1]
    new_url = 'http:'+url
    response = requests.get(new_url,headers=headers)  # 请求地址必须http 没有就会报错
    with open(dir_name + '/' +file_name,'wb') as f:
        f.write(response.content)

 

Guess you like

Origin blog.csdn.net/tang242424/article/details/107486628