Use python to write the simplest web crawler, Douban 250

Tip: After the article is written, the table of contents can be automatically generated. For how to generate it, please refer to the help document on the right


Preface

Write a simple crawler code in python. Crawl the top posters of Douban movies.

1. What is a crawler?

Web crawlers (also known as web spiders, web robots) are programs or scripts that automatically crawl information on the World Wide Web according to certain rules. Other less commonly used names include ants, automatic indexing, simulators, or worms.

Second, use steps

1. Import the library

from urllib import request
import re
url = 'https://movie.douban.com/top250'
herders={
    
    
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1;WOW64) AppleWebKit/537.36 (KHTML,like GeCKO) Chrome/45.0.2454.85 Safari/537.36 115Broswer/6.0.3',
    'Referer':'https://movie.douban.com/',
    'Connection':'keep-alive'}
req = request.Request(url,headers=herders)
req = request.urlopen(req)
# type(req)
print(req.status)
# 应答的状态
con = req.read().decode('utf-8')
#使用正则匹配,找到目的数据,从可以理解为查找源,re.S有了之后就可以查找所有
rs = re.findall(r'<li>(.*?)</li>',con,re.S)  #findall 是一个查找函数
li = rs[0]
# print(li)
imginf = re.findall(r'<img(.*?)" class="">',li)
info = imginf[0]
print(info)
result = info.split(' alt=')
print(result[1])
result1 = result[1].split(' src="')
result2 = result1[1]
print(result2)
imgreq = request.urlopen(result2)
re_img = imgreq.read()
imgf = open(r'D:\\python爬虫\1.jpg','wb')
imgf.write(re_img)
imgf.close()

2. Read in the data

src="https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2614500649.jpg" class="">
<img width="100" alt="无间道" src="https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2564556863.jpg" class="">
<img width="100" alt="教父" src="https://img9.doubanio.com/view/photo/s_ratio_poster/public/p616779645.jpg" class="">
<img width="100" alt="龙猫" src="https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2540924496.jpg" class="">
<img width="100" alt="当幸福来敲门" src="https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2614359276.jpg" class="">
<img width="100" alt="怦然心动" src="https://img1.doubanio.com/view/photo/s_ratio_poster/public/p501177648.jpg" class="">
<img width="100" alt="触不可及" src="https://img9.doubanio.com/view/photo/s_ratio_poster/public/p1454261925.jpg" class="">
25

to sum up

Add here

herders={
    
    
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1;WOW64) AppleWebKit/537.36 (KHTML,like GeCKO) Chrome/45.0.2454.85 Safari/537.36 115Broswer/6.0.3',
    'Referer':'https://movie.douban.com/',
    'Connection':'keep-alive'}
req = request.Request(url,headers=herders)

I am also a
noob. If I don’t add this, I will prohibit access and report an error like this every time: urllib.error.HTTPError: HTTP Error 418: I
need to rewrite the headers after checking it. Also, I feel that my code is fine. The websites I visit are superimposed every time, but I only visit them all the time, 1-25, and the latter will not enter. If anyone knows the reason, please help me Give me some advice, thank you!

Guess you like

Origin blog.csdn.net/weixin_45070922/article/details/109643301