[Python Crawler Introduction] Export pictures and record information as a table


Download images from the website


There are three parts to download website pictures with HTTP: grab the source code of the webpage; get the hyperlink of the picture; download the picture to the local folder according to the hyperlink URL of the picture .


urllib uses request.urlopen( ) to open and read URL information, and the returned object response is like a text object, which can be read by calling read( ). The following code will print out the collected web page html information:
from urllib import request
response = request.urlopen("http://fanyi.baidu.com")
html = response.read()
html = html.decode("utf-8")  # decode()命令将网页信息解码,否则乱码
print(html)

Check the encoding method of the web page: use a browser to view the source code of the web page, call the developer tool to select the element or directly press the F12 shortcut key, you only need to find the chareset at the beginning of the head tag ( in a certain meta), and you will know that the web page uses what kind of encoding.

Please add a picture description

There are two general methods for downloading webpage files locally: one is to use the request.urlretrieve( ) function , and the other is to use Python’s file operation write( ) function to write files . The following is a sample code for crawling images on the website and downloading them locally:

'''
    第一个简单的爬取图片程序,使用Python3.x和urllib库
'''

import urllib.request
import re
import os


def getHtmlCode(url):  # 该方法传入url,返回url的html源码
    headers = {
    
    
        'User - Agent': 'Mozilla/5.0(Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWEBkIT/537.36(KHTML,like Cecko)Chorme/56.0.2924.87 Mobile Safari/537.36'
    }

    # Request 函数将url添加头部,模拟浏览器访问
    url1 = urllib.request.Request(url, headers=headers)
    page = urllib.request.urlopen(url1).read()  # 将url1页面的源代码保存成字符串
    page = page.decode('GB2312', 'ignore')  # 字符串转码

    return page


def getImg(page):  # 该方法传入html的源码,经过截取其中的img标签,将图片保存到本机
    # 考虑采用https还是http协议
    imgList = re.findall(r'(https:[^\s]*?(jpg|png|gif))', page)
    x = 100
    if not os.path.exists("E:/img1"):
        os.mkdir("E:/img1")

    for imgUrl in imgList:
        try:
            print('正在下载: %s' % imgUrl[0])
            urllib.request.urlretrieve(
                imgUrl[0], 'E:/img1/%d.%s' % (x, imgUrl[1]))
            x += 1
        except:
            continue


if __name__ == '__main__':
    url = 'https://www.book123.info/list?key=python'  # 搜索无名图书python的网址页面
    page = getHtmlCode(url)
    # page = urllib.request.urlopen(url).read()  # 将url1页面的源代码保存成字符串
    # page = page.decode('GB2312', 'ignore')  # 字符串转码
    # page = getHtmlCode(url)
    getImg(page)

insert image description here

In fact, mainstream websites already have certain anti-crawler technology, so a non-profit website like Wuming Books.com is selected here for crawling practice. The crawling of mainstream websites may need to be further adjusted, such as using cookie technology to complete account login verification.



Extract website content records as tables


Since regular expressions will take some time to get started, borrowing the third-party library BeautifulSoup , you can intercept the content of the webpage according to the name of the label, and the operation will be more intuitive at this time.

BeautifulSoup is a function library for Python to process HTML/XML . It is a built-in web page analysis tool in Python. It is used to quickly convert captured web pages. It provides some simple methods and Python-like syntax to find, locate, and modify a conversion tree. after the DOM tree . There are corresponding Chinese documents for reference, and it is more friendly to learn.


You can create a beautifulSoup object with a string containing html content, a local HTML file, or a URL. The process of creating a URL is as follows:

from urllib import request
from bs4 import BeautifulSoup

response = request.urlopen("http://www.baidu.com")
html = response.read()
html = html.decode("utf-8")
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())  # 格式化输出内容

Please add a picture description

Beautiful Soup converts complex HTML documents into a complex tree structure, each node is a Python object , and all objects can be summarized into the following four types:

  • Tag
  • NavigableString
  • BeautifulSoup
  • Comment

Tag is a label in HTML

print(soup.title)
print(soup.head)

NavigableString can extract the text inside the label → very simple, just use string

print(soup.title.string)

BeautifulSoup represents the entire content of a document. Most of the time, it can be regarded as a Tag object, which is a special Tag. The following code can obtain its type, name and attributes respectively.

print(type(soup))
print(soup.name)
print(soup.attrs)

The Comment object is a special type of NavigableString object, and its content does not include comment symbols. If it is not handled properly, it may cause unexpected troubles to the text processor.


Note about .string: If there is no label in a label, ".string" will return the content in the label. If there is only one tag in the tag, ".string" will also return the content of the innermost tag. If the tag contains multiple sub-tag nodes, the tag cannot be determined, which sub-tag node should be called by the .string method, and the output of ".string" is "None".


In order to avoid IP being blocked during collection, proxies are often used. The following shows a process of crawling Dangdang online book information and writing it into a csv table:

from bs4 import BeautifulSoup
import requests
import csv


def get_all_books():  # 获取每本书的连接URL
    '''
        获取该页面所有符合要求的图书的连接
    '''
    url = 'http://search.dangdang.com/?key=Python&act=input'
    book_list = []
    r = requests.get(url, timeout=30)
    soup = BeautifulSoup(r.text, 'lxml')

    book_ul = soup.find_all('ul', {
    
    'class': 'bigimg'})
    book_ps = book_ul[0].find_all('p', {
    
    'class': 'name', 'name': 'title'})
    for book_p in book_ps:
        book_a = book_p.find('a')
        book_url = 'http:'+book_a.get('href')  # 对应详细信息的页面链接URL,需要'http:'
        book_title = book_a.get('title')  # 书名
        book_list.append(book_url)

    return book_list


def get_information(book_url):
    '''
        获取每本书籍的信息
    '''
    print(book_url)
    headers = {
    
    
        'User-Agent': 'MMozilla/5.0(Windows NT 6.1; WOW64; rv:31.0) Gecko/20200201 Firefox/31.0'}
    r = requests.get(book_url, headers=headers)
    soup = BeautifulSoup(r.text, 'lxml')
    book_info = []

    # 获取书籍名称
    div_name = soup.find('div', {
    
    'class': 'name_info', 'ddt-area': '001'})
    h1 = div_name.find('h1')
    book_name = h1.get('title')
    book_info.append(book_name)

    # 获取书籍作者
    div_author = soup.find('div', {
    
    'class': 'messbox_info'})
    span_author = div_author.find('span', {
    
    'class': 't1', 'dd_name': '作者'})
    book_author = span_author.text.strip()[3:]
    book_info.append(book_author)

    # 获取书籍出版社
    div_press = soup.find('div', {
    
    'class': 'messbox_info'})
    span_press = div_press.find('span', {
    
    'class': 't1', 'dd_name': '出版社'})
    book_press = span_press.text.strip()[4:]
    book_info.append(book_press)

    # 获取书籍价钱
    div_price = soup.find('div', {
    
    'class': 'price_d'})
    book_price = div_price.find('p', {
    
    'id': 'dd-price'}).text.strip()
    book_info.append(book_price)

    return book_info


def main():
    header = ['书籍名称', '作者', '出版社', '当前价钱']
    # 文件操作
    with open('Python_book_info.csv', 'w', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(header)
        books = get_all_books()
        for i, book in enumerate(books):
            if i % 10 == 0:
                print('获取了{}条信息,一共{}条信息'.format(i, len(books)))
            l = get_information(book)
            writer.writerow(l)


if __name__ == '__main__':
    main()


In the end, it was anti-climbed, but it is gratifying to collect the information , which shows that the practice is successful.

Please add a picture description


Please add a picture description

  I did a little introduction to reptiles during the summer vacation, and completed some small projects with reference to the book "Python Reptile Super Detailed Practical Raiders" by Xia Agile. I am still very happy, so I will share some selected parts here.

Guess you like

Origin blog.csdn.net/weixin_47305073/article/details/126732940