Just a hundred lines of code, Python takes you to the Hanfu circle

When traveling, we often see tourists wearing various costumes to take pictures in tourist attractions, and we don't pay much attention to it. When I browsed the web two days ago, I didn't intend to see a website. I saw a girl in Hanfu. Whether it is work needs or creating copywriting, it is a good idea to use such beautiful pictures as materials. If necessary, we will climb it, climb it, climb it!

Not much to say, we will introduce image crawling in detail below.

Analyze the website

The URL is as follows:

'http://www.aihanfu.com/zixun/tushang-1/'

This is the URL of the first page. According to observations, the URL of the second page, that is, the number 1 of the aforementioned website, becomes 2, and so on, you can access all pages.
Insert picture description here
According to the picture, we need to get the link of each sub-site, that is, the URL in href, then enter each URL, look for the image URL, and just download it.

Sub-link acquisition

In order to obtain the data in the above figure, we can use methods such as soup or re or xpath. In this article, the editor uses xpath to locate, write the positioning function, get the link of each sub-site, and then return to the main function. Here is a technique used , In the for loop, you can take a look!

def get_menu(url, heades):
    """
    根据每一页的网址
    获得每个链接对应的子网址
    params: url 网址
    """
    r = requests.get(url, headers=headers)
    if r.status_code == 200:
        r.encoding = r.apparent_encoding
        html = etree.HTML(r.text)
        html = etree.tostring(html)
        html = etree.fromstring(html)
        # 查找每个子网址对应的链接, 然后返回
        children_url = html.xpath('//div[@class="news_list"]//article/figure/a/@href')
        for _ in children_url:
            yield _

Get title and image address

In order to collect as much data as possible, we collect the tags and image addresses. Of course, if other projects need to collect the publisher and time, there can be more, so this article will not be expanded.
Insert picture description here
We click on a URL link, as shown in the figure above, we can find that the title is in the head node, and the title is used when creating a folder.

code show as below:

def get_page(url, headers):
    """
    根据子页链接,获得图片地址,然后打包下载
    params: url 子网址
    """
    r = requests.get(url, headers=headers)
    if r.status_code == 200:
        r.encoding = r.apparent_encoding
        html = etree.HTML(r.text)
        html = etree.tostring(html)
        html = etree.fromstring(html)
        # 获得标题
        title = html.xpath(r'//*[@id="main_article"]/header/h1/text()')
        # 获得图片地址
        img = html.xpath(r'//div[@class="arc_body"]//figure/img/@src')
        # title 预处理 
        title = ''.join(title)
        title = re.sub(r'【|】', '', title)
        print(title)
        save_img(title, img, headers)

save Picture

When flipping each page, we need to save the pictures corresponding to the sub-links. Here we need to pay attention to the status judgment and path judgment of the request.

def save_img(title, img, headers):
    """
    根据标题创建子文件夹
    下载所有的img链接,选择更改质量大小
    params:title : 标题
    params:  img :  图片地址
    """
    if not os.path.exists(title):
        os.mkdir(title)
    # 下载
    for i, j in enumerate(img):  # 遍历该网址列表
        r = requests.get(j, headers=headers)
        if r.status_code == 200:
            with open(title + '//' + str(i) + '.png', 'wb') as fw:
                fw.write(r.content)
        print(title, '中的第', str(i), '张下载完成!')

Main function

if __name__ == '__main__':
    """ 
    一页一页查找
    params : None
    """
    path = '/Users/********/汉服/'
    if not os.path.exists(path):
        os.mkdir(path)
        os.chdir(path)
    else:
        os.chdir(path)

    # url = 'http://www.aihanfu.com/zixun/tushang-1/'
    headers = {
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
                             ' AppleWebKit/537.36 (KHTML, like Gecko)'
                             ' Chrome/81.0.4044.129 Safari/537.36'}
    for _ in range(1, 50):
        url = 'http://www.aihanfu.com/zixun/tushang-{}/'.format(_)
        for _ in get_menu(url, headers):
            get_page(_, headers)  # 获得一页

So far we have completed all the links. The editor has introduced articles about crawlers more than once. On the one hand, I hope everyone can be familiar with crawling skills. On the other hand, I believe that crawlers are the basis of data analysis and data mining. There is no crawler to get data, how can data analysis be done? In order for everyone to learn and grow better in crawlers, students who need a complete code, click the link below to get it .

Code link

Recommended reading

For more exciting content, follow the WeChat public account "Python learning and data mining"

In order to facilitate technical exchanges, this account has opened a technical exchange group. If you have any questions, please add a small assistant WeChat account: connect_we. Remarks: The group is from CSDN, welcome to reprint, favorites, codewords are not easy, like the article, just like it! Thanks
Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_38037405/article/details/107527731