The first experience of python crawler, crawling pictures of Douban Top250 movies

1. Crawl Douban Top250 source code

import requests
    url = 'https://movie.douban.com/top250?start=0&filter='
    data = requests.get(url)
    print(data.text)

2. View the structure of the data to be crawled

1. Click F12 on the website to view the code of the web page, select the image element to find the location of its code , then right-click to select Edit HTML, and copy the name of the image;

2. Then right click on the web page and select view page source code > press ctrl and F to open the search bar and put the picture name you just copied into it;

3. You can see that the address stored in the src in the <img> tag is the address of the picture we are looking for. Let’s crawl the address of the picture below;

3. Crawl the image address of a page

import requests
from lxml import etree

url = 'https://movie.douban.com/top250?start=0&filter='
data = requests.get(url)
html = etree.HTML(data.text)
res = html.xpath('//img/@src')    #使用路径表达式来匹配我们要找的数据
for i in range(0,len(res)):
    print(res[i])

1. Now the address of the picture on the first page has been crawled, and then we will crawl all 250;

2. First observe the rules of the links on these pages:

The first page: https://movie.douban.com/top250?start=0&filter=

The second page: https://movie.douban.com/top250?start=25&filter=

The third page: https://movie.douban.com/top250?start=50&filter=

Well, the rule has come out here, start=(what number does this page start from), and then start to crawl the movie picture addresses of ten pages;

4. Crawl all image addresses

import requests
from lxml import etree

def main(start):    
    url = 'https://movie.douban.com/top250?start=' +str(start)+ '&filter=' #start参数就是下面会变化的i值,用来控制start=?
    data = requests.get(url)
    html = etree.HTML(data.text)
    res = html.xpath('//img/@src')
    for i in range(0,len(res)):
        print(res[i])

if __name__ == '__main__':
    for i in range(10):    #i从0开始循环十次,遍历十个页面
        main(start=i*25)

Well, at this point, the addresses of all the pictures have been crawled, but our ultimate goal is to download the pictures to the local, Let's go!

5. Download the picture to the local

    This is the final version, download all the pictures locally;

 

#__Gage__
import requests
from lxml import etree
import urllib.request

def main(start):
    url = 'https://movie.douban.com/top250?start=' +str(start)+ '&filter='
    data = requests.get(url)
    html = etree.HTML(data.text)
    res = html.xpath('//img/@src')
    for i in range(0,len(res)):
        urllib.request.urlretrieve(res[i],filename="./movie/"+str(start+i)+".jpg")#filename=(自己要存图片的目录),后面加的是图片的命名
        print(str(start+i)+".jpg")

if __name__ == '__main__':
    for i in range(10):
        main(start=i*25)


 

Guess you like

Origin blog.csdn.net/Gage__/article/details/79836852