Python e-commerce crawler nanny-level introductory tutorial (pure novice orientation)

picture

Turing Python Classroom

Changsha Turing Education began to enter the education industry in 2001, based on pan-IT vocational education, with the aim of creating high-tech talents, focusing on providing multi-level and personalized vocational skills training courses, and cultivating technology development, application and skills for various industries. Mid-to-high-end talents in management and other positions are committed to becoming a high-quality vocational education content provider.

01

Advantages of Python

For the development of web crawlers, Python has unparalleled natural advantages. Here, its advantages are analyzed and explained from two aspects.

1. Grab the e-commerce product details API interface of the webpage itself

Compared with other static programming languages ​​(such as java, c# and c++), Python has a more concise interface for grabbing web documents, and compared with other dynamic scripting languages ​​(such as perl, shell), Python's urllib package provides a relatively complete API for accessing web documents .

In addition, crawling webpages sometimes requires simulating browser behavior, and many websites are blocked for blunt crawlers. At this point, it is necessary to simulate the behavior of the user agent to construct an appropriate request (simulate user login, simulate session/cookie storage and setting). There are excellent third-party packages in Python to help with these tasks (such as Requests, mechanize).

2. Processing after web crawling

Crawled web pages usually need to be processed, such as filtering html tags, extracting text, etc. Python's beautifulsoap provides a concise document processing function, which can complete most document processing with very short codes.

In fact, many languages ​​and tools can do the above functions, but using Python can do it the fastest and cleanest.

Life is short, you need python.

PS: python2.x and python3.x are very different. This article only discusses the crawler implementation method of python3.x.

02

crawler framework

URL manager: manage the collection of urls to be crawled and the collection of urls that have been crawled, and send the urls to be crawled to the web page downloader.

Web page downloader (urllib): Crawl the web page corresponding to the url, store it as a string, and send it to the web page parser.

Web page parser (BeautifulSoup): parse out valuable data, store it, and add url to the URL manager at the same time.

03

URL manager

basic skills

  • Add a new url to the collection of urls to be crawled.

  • Determine whether the URL to be added is in the container (including the collection of URLs to be crawled and the collection of crawled URLs).

  • Get the url to be crawled.

  • Determine whether there is a url to be crawled.

  • Move the crawled urls from the urls to be crawled collection to the crawled urls collection.

storage method

1. Memory (python memory)
url collection to be crawled: set()
has crawled url collection: set()

2. Relational database (mysql)
urls(url, is_crawled)

3. Cache (redis)
url collection to be crawled: set
Crawled url collection: set

Large Internet companies generally store URLs in the cache database due to the high performance of the cache database. Small companies generally store URLs in memory, and store them in a relational database if they want to store them permanently.

05

web downloader urllib

Download the web page corresponding to the url to the local, and store it as a file or string.

basic method

Create a new baidu.py, the content is as follows:

import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')buff = response.read()html = buff.decode("utf8")print(html)
 
 

Execute python baidu.py on the command line, and you can print out the obtained page.

Construct Request

The above code can be modified to:

import urllib.requestrequest = urllib.request.Request('http://www.baidu.com')response = urllib.request.urlopen(request)buff = response.read()html = buff.decode("utf8")print(html)
 
 

carry parameters

Create a new baidu2.py, the content is as follows:

import urllib.requestimport urllib.parse
url = 'http://www.baidu.com'values = {'name': 'voidking','language': 'Python'}data = urllib.parse.urlencode(values).encode(encoding='utf-8',errors='ignore')headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0' }request = urllib.request.Request(url=url, data=data,headers=headers,method='GET')response = urllib.request.urlopen(request)buff = response.read()html = buff.decode("utf8")print(html)
 
 

Use Fiddler to monitor data

To see if the request actually carries parameters, you need to use fiddler.

add processor

 
 

import urllib.requestimport http.cookiejar# 创建cookie容器cj = http.cookiejar.CookieJar()# 创建openeropener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))# 给urllib.request安装openerurllib.request.install_opener(opener)# 请求request = urllib.request.Request('http://www.baidu.com/')response = urllib.request.urlopen(request)buff = response.read()html = buff.decode("utf8")print(html)print(cj)

06

Web parser (BeautifulSoup)

Extract valuable data and new url lists from web pages.

parser selection

In order to implement the parser, you can choose to use regular expressions, html.parser, BeautifulSoup, lxml, etc. Here you choose BeautifulSoup. Among them, regular expressions are based on fuzzy matching, while the other three are based on DOM structured analysis.

BeautifulSoup installation test

1. To install, execute pip install beautifulsoup4 on the command line.
2. Test

 
 

import bs4print(bs4)

basic usage

1. Create a BeautifulSoup object

 
 

import bs4from bs4 import BeautifulSoup

# 根据html网页字符串创建BeautifulSoup对象html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""soup = BeautifulSoup(html_doc)print(soup.prettify())

2. Access nodes

 
 

print(soup.title)print(soup.title.name)print(soup.title.string)print(soup.title.parent.name)

print(soup.p)print(soup.p['class'])

3. Specify tag, class or id

 
 

print(soup.find_all('a'))print(soup.find('a'))print(soup.find(class_='title'))print(soup.find(id="link3"))print(soup.find('p',class_='title'))

4. Find all <a> tag links from the document

 
 

for link in soup.find_all('a'):    print(link.get('href'))

There is a warning. According to the prompt, when creating the BeautifulSoup object, just specify the parser.

 
 

soup = BeautifulSoup(html_doc,'html.parser')

5. Get all the text content from the document

 
 

print(soup.get_text())

6. Regular matching

 
 

link_node = soup.find('a',href=re.compile(r"til"))print(link_node)

Guess you like

Origin blog.csdn.net/onebound_linda/article/details/131892097