Python Reptile development tutorial

 

 text  

Python language is now the fire, in the field of web crawlers, artificial intelligence, big data, and so have a very good application. Today I would like to tell you about some of the knowledge and common usage of Python libraries reptiles, hoping to be helpful to everyone.
In fact, reptile The concept is simple, which can be divided into the following steps:

  • Initiating a network request

  • Get page

  • Parsing the data acquisition web page

The step of initiating a network request this library used urllib standard libraries and library requests on common Python. Page useful analytic libraries and some BeautifulSoup. Also requests the authors also developed a library requests-html Another good use, provides a two-in function to initiate requests and parsing web pages, development of small reptiles very convenient. There are also some professional reptile library, one of the more famous is the scrapy. This article will briefly explain these libraries, and then will write an article specifically describes the usage of scrapy.

 

The standard library urllib

Let's look first standard library urllib. Python is an advantage that standard library's own, does not require any third-party libraries, it belongs urllib partial drawback is the underlying library, cumbersome to use. Here is a simple example urllib initiated the request, we look like. You can see a simple request to initiate, we need to create the opener, request, ProxyHandler and other several objects, too much trouble.

import urllib.request as request
import requests

proxies = {
    'https': 'https://127.0.0.1:1080',
    'http': 'http://127.0.0.1:1080'
}

headers = {
    'user-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0'
}


print('--------------使用urllib--------------')
url = 'http://www.google.com'
opener = request.build_opener(request.ProxyHandler(proxies))
request.install_opener(opener)
req = request.Request(url, headers=headers)
response = request.urlopen(req)
print(response.read().decode())

requests

requests Kenneth Reitz is one of the famous works of the great God, the advantage is extremely simple and easy to use. First, let's install the requests.

pip install requests

The following is a simple example, and the same as the above example code urllib function, but much less code, and easier to read.

print('--------------使用requests--------------')
response = requests.get('https://www.google.com', headers=headers, proxies=proxies)
response.encoding = 'utf8'
print(response.text)

requests can also easily transmit the form data, the analog user. Response returned object also contains a lot of useful information about the status code, header, raw, cookies and the like.

data = {
    'name': 'yitian',
    'age': 22,
    'friends': ['zhang3', 'li4']
}
response = requests.post('http://httpbin.org/post', data=data)
pprint(response.__dict__)
print(response.text)

About requests I do not introduced, because it has a Chinese document, although several small backward than the official version, but harmless, we can rest assured refer.

http://cn.python-requests.org/zh_CN/latest/

beautifulsoup

Library use requests described earlier, we can easily get the HTML code, but in order to find the required data from HTML, we also need to HTML / XML parsing library, BeautifulSoup is such a common library. First, the first to install it:

pip install beautifulsoup4

The use my simple home page book as an example, crawling my list of articles Jane books. First to get to page content with requests.

from pprint import pprint
import bs4
import requests

headers = {
    'user-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0'
}

url = 'https://www.jianshu.com/u/7753478e1554'
response = requests.get(url, headers=headers)

Then there BeautifulSoup code. When using BeautifulSoup first you need to create an HTML tree, and then find the node from the tree. BeautifulSoup There are two main ways to find the node, the first is to use the find and find_all method, the second method is to use the select method with css selector. After get the node with the contents to acquire its children, if the child node is text, you will get a text value, pay attention to this property returns a list of, so to add [0].

html = bs4.BeautifulSoup(response.text, features='lxml')
note_list = html.find_all('ul', class_='note-list', limit=1)[0]
for a in note_list.select('li>div.content>a.title'):
    title = a.contents[0]
    link = f'https://www.jianshu.com{a["href"]}'
    print(f'《{title}》,{link}')

BeautifulSoup there are Chinese documents, it is also slightly behind the two smaller versions, has little effect.

https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

requests-html

This library is requests brother, Kenneth Reitz is also the god of work. Web page request and resolve it to bind together. If you use requests would have been if only requests a web page, the page had to use in order to resolve such BeautifulSoup parsing library. Now only need a library requests-html can do.
First come first installation.

pip install requests-html

Then we look at how to use requests-html rewrite the above example.

from requests_html import HTMLSession
from pprint import pprint

url = 'https://www.jianshu.com/u/7753478e1554'
headers = {
    'user-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0'
}
session = HTMLSession()
r = session.get(url, headers=headers)
note_list = r.html.find('ul.note-list', first=True)
for a in note_list.find('li>div.content>a.title'):
    title = a.text
    link = f'https://www.jianshu.com{a.attrs["href"]}'
    print(f'《{title}》,{link}')

requests-html css selector can be used in addition to search outside, you can also use xpath to find.

for a in r.html.xpath('//ul[@class="note-list"]/li/div[@class="content"]/a[@class="title"]'):
    title = a.text
    link = f'https://www.jianshu.com{a.attrs["href"]}'
    print(f'《{title}》,{link}')

requests-html Another useful feature is the browser rendering. Some pages are loaded asynchronously, the direct use of reptiles climb only get a blank page, because the data is run by the browser JS script loaded asynchronously, this time you need a browser rendering. The browser requests-html rendering done with very simple, as long as we can call a function render. render function takes two parameters, the decline in the number of pages specified separately and pause time. render function when the first run, requests-html browser will download a chromium, and then use it to render the page.
Individual article page book is a simple example of asynchronous loading by default only displays the most recent articles, analog slipped through the browser page rendering, we can get a list of all articles.

session = HTMLSession()
r = session.get(url, headers=headers)
# render函数指示requests-html用chromium浏览器渲染页面
r.html.render(scrolldown=50, sleep=0.2)
for a in r.html.xpath('//ul[@class="note-list"]/li/div[@class="content"]/a[@class="title"]'):
    title = a.text
    link = f'https://www.jianshu.com{a.attrs["href"]}'
    print(f'《{title}》,{link}')

Similarly, today's headlines personal page is loaded asynchronously, so they have to call the render function.

from requests_html import HTMLSession

headers = {
    'user-agent':
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0'
}
session = HTMLSession()
r = session.get('https://www.toutiao.com/c/user/6662330738/#mid=1620400303194116', headers=headers)
r.html.render()

for i in r.html.find('div.rbox-inner a'):
    title = i.text
    link = f'https://www.toutiao.com{i.attrs["href"]}'
    print(f'《{title}》 {link}')

Finally requests-html official website address, and Chinese documents.

https://html.python-requests.org/
https://cncert.github.io/requests-html-doc-cn/#/?id=requests-html

scrapy

Several framework described above are each have their own role, they can be combined to achieve the purpose of writing reptiles, reptile But to say that professional framework, or need to talk scrapy. As a well-known reptile framework, scrapy the crawler frame and modular model, use scrapy, we can quickly generate powerful reptiles.
But many scrapy concept had to be careful to say a special opening article, here simply show you. First install scrapy, if a Windows system, also need to install pypiwin32.

pip install scrapy
pip install pypiwin32

Then create a new project and add scrapy reptile.

scrapy startproject myproject
cd myproject
scrapy genspider my jianshu.com

Open the configuration file settings.py, set the user agent, otherwise they will encounter 403 errors.

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0'

Then change it reptiles.

# -*- coding: utf-8 -*-
import scrapy

class JianshuSpider(scrapy.Spider):
    name = 'jianshu'
    allowed_domains = ['jianshu.com']
    start_urls = ['https://www.jianshu.com/u/7753478e1554']

    def parse(self, response):
        for article in response.css('div.content'):
            yield {
                'title': article.css('a.title::text').get(),
                'link': 'https://www.jianshu.com' + article.xpath('a[@class="title"]/@href').get()
            }

Finally, running about reptiles.

scrapy crawl my

These are the contents of this article, I hope to help everyone.

Guess you like

Origin www.cnblogs.com/wjw-zm/p/11789980.html