Turing Python Classroom
Changsha Turing Education began to enter the education industry in 2001, based on pan-IT vocational education, with the aim of creating high-tech talents, focusing on providing multi-level and personalized vocational skills training courses, and cultivating technology development, application and skills for various industries. Mid-to-high-end talents in management and other positions are committed to becoming a high-quality vocational education content provider.
01
Advantages of Python
For the development of web crawlers, Python has unparalleled natural advantages. Here, its advantages are analyzed and explained from two aspects.
1. Grab the e-commerce product details API interface of the webpage itself
Compared with other static programming languages (such as java, c# and c++), Python has a more concise interface for grabbing web documents, and compared with other dynamic scripting languages (such as perl, shell), Python's urllib package provides a relatively complete API for accessing web documents .
In addition, crawling webpages sometimes requires simulating browser behavior, and many websites are blocked for blunt crawlers. At this point, it is necessary to simulate the behavior of the user agent to construct an appropriate request (simulate user login, simulate session/cookie storage and setting). There are excellent third-party packages in Python to help with these tasks (such as Requests, mechanize).
2. Processing after web crawling
Crawled web pages usually need to be processed, such as filtering html tags, extracting text, etc. Python's beautifulsoap provides a concise document processing function, which can complete most document processing with very short codes.
In fact, many languages and tools can do the above functions, but using Python can do it the fastest and cleanest.
Life is short, you need python.
PS: python2.x and python3.x are very different. This article only discusses the crawler implementation method of python3.x.
02
crawler framework
URL manager: manage the collection of urls to be crawled and the collection of urls that have been crawled, and send the urls to be crawled to the web page downloader.
Web page downloader (urllib): Crawl the web page corresponding to the url, store it as a string, and send it to the web page parser.
Web page parser (BeautifulSoup): parse out valuable data, store it, and add url to the URL manager at the same time.
03
URL manager
basic skills
-
Add a new url to the collection of urls to be crawled.
-
Determine whether the URL to be added is in the container (including the collection of URLs to be crawled and the collection of crawled URLs).
-
Get the url to be crawled.
-
Determine whether there is a url to be crawled.
-
Move the crawled urls from the urls to be crawled collection to the crawled urls collection.
storage method
1. Memory (python memory)
url collection to be crawled: set()
has crawled url collection: set()
2. Relational database (mysql)
urls(url, is_crawled)
3. Cache (redis)
url collection to be crawled: set
Crawled url collection: set
Large Internet companies generally store URLs in the cache database due to the high performance of the cache database. Small companies generally store URLs in memory, and store them in a relational database if they want to store them permanently.
05
web downloader urllib
Download the web page corresponding to the url to the local, and store it as a file or string.
basic method
Create a new baidu.py, the content is as follows:
import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')
buff = response.read()
html = buff.decode("utf8")print(html)
Execute python baidu.py on the command line, and you can print out the obtained page.
Construct Request
The above code can be modified to:
import urllib.request
request = urllib.request.Request('http://www.baidu.com')
response = urllib.request.urlopen(request)
buff = response.read()
html = buff.decode("utf8")
print(html)
carry parameters
Create a new baidu2.py, the content is as follows:
import urllib.requestimport urllib.parse
url = 'http://www.baidu.com'values = {'name': 'voidking','language': 'Python'}data = urllib.parse.urlencode(values).encode(encoding='utf-8',errors='ignore')
headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0' }
request = urllib.request.Request(url=url, data=data,headers=headers,method='GET')
response = urllib.request.urlopen(request)
buff = response.read()
html = buff.decode("utf8")
print(html)
Use Fiddler to monitor data
To see if the request actually carries parameters, you need to use fiddler.
add processor
import urllib.request
import http.cookiejar# 创建cookie容器cj = http.cookiejar.CookieJar()# 创建openeropener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))# 给urllib.request安装openerurllib.request.install_opener(opener)# 请求request = urllib.request.Request('http://www.baidu.com/')
response = urllib.request.urlopen(request)
buff = response.read()
html = buff.decode("utf8")
print(html)
print(cj)
06
Web parser (BeautifulSoup)
Extract valuable data and new url lists from web pages.
parser selection
In order to implement the parser, you can choose to use regular expressions, html.parser, BeautifulSoup, lxml, etc. Here you choose BeautifulSoup. Among them, regular expressions are based on fuzzy matching, while the other three are based on DOM structured analysis.
BeautifulSoup installation test
1. To install, execute pip install beautifulsoup4 on the command line.
2. Test
import bs4print(bs4)
basic usage
1. Create a BeautifulSoup object
import bs4
from bs4 import BeautifulSoup
# 根据html网页字符串创建BeautifulSoup对象
html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p><p class="story">...</p>"""
soup = BeautifulSoup(html_doc)
print(soup.prettify())
2. Access nodes
print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.title.parent.name)
print(soup.p)
print(soup.p['class'])
3. Specify tag, class or id
print(soup.find_all('a'))
print(soup.find('a'))
print(soup.find(class_='title'))
print(soup.find(id="link3"))
print(soup.find('p',class_='title'))
4. Find all <a> tag links from the document
for link in soup.find_all('a'):
print(link.get('href'))
There is a warning. According to the prompt, when creating the BeautifulSoup object, just specify the parser.
soup = BeautifulSoup(html_doc,'html.parser')
5. Get all the text content from the document
print(soup.get_text())
6. Regular matching
link_node = soup.find('a',href=re.compile(r"til"))
print(link_node)