爬虫--xpath匹配,requests库

  • 使用xpath得到老师的图片链接和简介信息,并且把图片保存下来,老师简介保存到文本中;

要求:

  1. 杨老师的信息图片<img src="pics/ygf.jpg"> 图片保存的名字叫ygf.jpg,其他老师类似;并且都保存到当前目录下的image目录
  2. 杨老师的信息保存文件名叫“ygf.txt”,其他老师类似;并且保存到当前目录下的text目录

代码:

import requests
from lxml import etree
import os


def save(img_url, desc):
    response = requests.get(img_url)
    if not os.path.exists('images'):
        os.makedirs('images')
    if not os.path.exists('text'):
        os.makedirs('text')
    img_url = img_url.split('/')  # split方法--- 字符串切割成列表
    image_file_name = img_url[len(img_url) - 1]
    file_name = image_file_name.replace('jpg', 'txt')
    with open('images/' + image_file_name, 'wb') as f:
        f.write(response.content)
    with open('text/' + file_name, 'w') as f:
        f.write(desc)
    print('保存成功', image_file_name)
    print('保存成功', file_name)


def main():
    url = 'http://www.atguigu.com/teacher.shtml'
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"
                      " Chrome/65.0.3325.146 Safari/537.36"}
    response = requests.get(url, headers=headers)
    # html = response.content.decode()
    html_obj = etree.HTML(response.content)
    # 得到图片
    result_list = html_obj.xpath('//div[@class="teacher_content"]/img/@src')

    for i in range(len(result_list)):
        # 拼接图片路径 
        img_url = 'http://www.atguigu.com/' + result_list[i]
        # 老师简介信息
        result_list1 = html_obj.xpath('//div[@class="teacher_content"][' + str(i + 1) + ']/text()')
        desc = "".join(result_list1).lstrip()
        print(img_url)
        save(img_url, desc)


if __name__ == '__main__':
    main()

运行效果:

 

猜你喜欢

转载自blog.csdn.net/a289237642/article/details/80924627