The actual combat of web crawlers

Guidance
1 Getting started with the Re (regular expression) library
Example 2 "Taobao Commodity Price Comparison Targeted Crawler" (requests+re)
Example 3 "Stock Data Oriented Crawler" (requests+bs4+re)

Guidance

Insert picture description here

1 Getting started with the Re (regular expression) library

1.1 The concept of regular expressions

Insert picture description here
It is too cumbersome to list all of them, so using regular expressions

can express a group of strings.
Example 1:

Example 2:

The features after compilation correspond to a group of strings
. The regular expression before compilation is just one that conforms to the regular expression syntax. Single string

1.2 Regular expression syntax

Insert picture description here

".": Any character that appears on the character table

The first one: does not consider the value range and space of each paragraph, only considers the "." between them to separate the
second: every string that appears is 0 or 1 or 2 or 3
above 2 are not precise enough
Insert picture description here

1.3 Basic use of Re library

Insert picture description here

1.3.1 Representation type of regular expression-native string type

Insert picture description here
The "in the native string is not expressed as an escape character,

so:

1.3.2 Main functions of Re library

Insert picture description here
re.search(): search for the same place as the regular expression in the string
re.match(): match only at the given position
re.findall(): find all the same as the regular expression in the string String of strings

re.search()

Insert picture description here

In the regular expression,'.' matches any character except'\n'. For
example: Chinese postal code, matching "BIT 10081"

, what's the use of writing'BIT', return from re.match(), this' BIT' has no specific meaning, just to show that search does not necessarily match from the beginning
Insert picture description here

re.match()

Insert picture description here

It is found that no match is found.

Because match matches from the beginning, it will not match'BIT 100081'.
If you don't judge whether it matches, an error will be reported.

re.findall()

Insert picture description here

re.split()

Insert picture description here

The second one above adds maxsplit=1 to indicate that it only matches once

re.finditer ()

Insert picture description here

Ability to iteratively return each result and process each result separately

re.compiler()

Insert picture description here

1.4 match object of Re library

Insert picture description here

1.4.1 Properties of the Match object

Insert picture description here

1.4.2 Methods of Match Object

Insert picture description here

The output is marked with compile, which means that only after compile will be

the start and end position of the regular expression search

Insert picture description here
match returns the result of the first match. If you want to return every time, you must use finditer() to

match

the binary relationship between the start and end positions of the string.

1.5 Greedy matching and minimum matching of the Re library

Insert picture description here
PY.*N: start with PY, end with N, and a string of any characters in the middle

1.6 Summary

Insert picture description here

Example 2 "Taobao Commodity Price Comparison Targeted Crawler" (requests+re)

1 "Taobao commodity price comparison directional crawler" example introduction

Insert picture description here

This example does not harass Taobao’s servers

2 "Taobao commodity price comparison directional crawler" example compilation

Overall structure

Insert picture description here

depth=2: crawl 2 pages
try:
…
except:
continue
crawling a page after an error, continue to crawl the next page

getHTMLText()

Insert picture description here

parsePage ()

Because the price of Taobao uses a scripting language, it can be done only by search, so there is no need to use bs, here is just a regular rule,
but now the search on Taobao will pop up the login interface, this crawler may not be suitable for Now the
Taobao source code in the video.
Its price is:

"View_price": 139.9
Its name is in:
“raw_Title”:xxxxx

Insert picture description here

The first slash is an escaped
regular expression: "view_price":"[\d.]* "

eval(): remove the double quotes and
take the following ":"

printGoodsList()

Insert picture description here

Indicates that the length of the first element of the output is 4, the length of the second element is 8, and the length of the third element is 16

3 summary

Insert picture description here
Source code

import requests
import re
 
def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""
     
def parsePage(ilt, html):
    try:
        plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)
        tlt = re.findall(r'\"raw_title\"\:\".*?\"',html)
        for i in range(len(plt)):
            price = eval(plt[i].split(':')[1])
            title = eval(tlt[i].split(':')[1])
            ilt.append([price , title])
    except:
        print("")
 
def printGoodsList(ilt):
    tplt = "{:4}\t{:8}\t{:16}"
    print(tplt.format("序号", "价格", "商品名称"))
    count = 0
    for g in ilt:
        count = count + 1
        print(tplt.format(count, g[0], g[1]))
         
def main():
    goods = '书包'
    depth = 3
    start_url = 'https://s.taobao.com/search?q=' + goods
    infoList = []
    for i in range(depth):
        try:
            url = start_url + '&s=' + str(44*i)
            html = getHTMLText(url)
            parsePage(infoList, html)
        except:
            continue
    printGoodsList(infoList)
     
main()

It doesn't seem to be good anymore, I haven't climbed anything
Insert picture description here

Example 3 "Stock Data Oriented Crawler" (requests+bs4+re)

1 Introduction to the "Stock Data Targeted Crawler" example

Insert picture description here

Insert picture description here
Looking at the source code, you can see that Baidu stock is more suitable.
However, Baidu stock cannot find many stock information on one page.

Therefore, open the Oriental Fortune website.

Insert picture description here

Insert picture description here
Refer to the storage method of the page to perform related calibration for each information source and information value. Key-value pairs can be used. The dictionary type is the data type that maintains the key-value pairs to save the information of each stock, and then use the dictionary to compare all stocks. Integration of information

2 Example preparation of "stock data directional crawler"

For debugging convenience, use the traceback library

Overall framework

Insert picture description here
getStockList(): get the list of stocks
getStockInfo(): get the information of a single stock

getHTMLText()

Insert picture description here

getStockList()

Eastern Fortune.com source code:
Insert picture description here
all stored in, as long as it is parsed, there will be stock codes (the last few digits of href)
using regular expressions, but not all hrefs in them will meet the conditions, so try can be used …Except

regular expression is the stock code of Shenzhen or Shanghai, starting with s, followed by h or z, and then 6 digits

getStockInfo()

The source code of Baidu stock

Insert picture description here

First, send a request to each stock.
Use try...except to ensure that the returned page is normal.
All stock information is encapsulated in

Under class='stock-bets', use the split function to obtain the complete part of the corresponding
stock name with the stock name in class='bets-name'
Insert picture description here

, because some names are also associated with other identifiers, using spaces which was taken after the separation points out part 0
then, the information in all of the stock

key

value

Complete code

import requests
from bs4 import BeautifulSoup
import traceback
import re
 
def getHTMLText(url):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""
 
def getStockList(lst, stockURL):
    html = getHTMLText(stockURL)
    soup = BeautifulSoup(html, 'html.parser') 
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
        except:
            continue
 
def getStockInfo(lst, stockURL, fpath):
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html=="":
                continue
            infoDict = {
    
    }
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div',attrs={
    
    'class':'stock-bets'})
 
            name = stockInfo.find_all(attrs={
    
    'class':'bets-name'})[0]
            infoDict.update({
    
    '股票名称': name.text.split()[0]})
             
            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val
             
            with open(fpath, 'a', encoding='utf-8') as f:
                f.write( str(infoDict) + '\n' )
        except:
            traceback.print_exc()
            continue
 
def main():
    stock_list_url = 'https://quote.eastmoney.com/stocklist.html'
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    output_file = 'D:/BaiduStockInfo.txt'
    slist=[]
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)
 
main()

This is not good either. The website has been revised and cannot be crawled out. The importance of keeping up with the times

3 Example optimization of "stock data directional crawler"

Improve user experience, but as long as requests and bs4 are used, the speed will not increase

3.1 Speed improvement: optimization of code recognition

Insert picture description here
Manually obtain the coding method.
Modified to:

The code of Oriental Fortune.com is'GB2312'.

Baidu stock adopts'utf-8', and it is not modified.

3.2 Experience improvement: add dynamic progress prompt

Insert picture description here
There are a lot of crawling pages, the progress is displayed dynamically, and the progress bar that does not wrap dynamically is added.

Add 1 count variable
No line breaks, using the escape character'\r', can bring the last cursor of the string we printed to the head of the current line, and the previous

'\r' will be overwritten in the next printing. Idle is forbidden Dropped, so use the command line

Python web crawler 3 (the examples are not very applicable)

The actual combat of web crawlers

Guidance

1 Getting started with the Re (regular expression) library

1.1 The concept of regular expressions

1.2 Regular expression syntax

1.3 Basic use of Re library

1.3.1 Representation type of regular expression-native string type

1.3.2 Main functions of Re library

re.search()

re.match()

re.findall()

re.split()

re.finditer ()

re.compiler()

1.4 match object of Re library

1.4.1 Properties of the Match object

1.4.2 Methods of Match Object

1.5 Greedy matching and minimum matching of the Re library

1.6 Summary

Example 2 "Taobao Commodity Price Comparison Targeted Crawler" (requests+re)

1 "Taobao commodity price comparison directional crawler" example introduction

2 "Taobao commodity price comparison directional crawler" example compilation

Overall structure

getHTMLText()

parsePage ()

printGoodsList()

3 summary

Example 3 "Stock Data Oriented Crawler" (requests+bs4+re)

1 Introduction to the "Stock Data Targeted Crawler" example

2 Example preparation of "stock data directional crawler"

Overall framework

getHTMLText()

getStockList()

getStockInfo()

3 Example optimization of "stock data directional crawler"

3.1 Speed improvement: optimization of code recognition

3.2 Experience improvement: add dynamic progress prompt

Guess you like

Python web crawler 3 (the examples are not very applicable)

The actual combat of web crawlers

Guidance

1 Getting started with the Re (regular expression) library

1.1 The concept of regular expressions

1.2 Regular expression syntax

1.3 Basic use of Re library

1.3.1 Representation type of regular expression-native string type

1.3.2 Main functions of Re library

re.search()

re.match()

re.findall()

re.split()

re.finditer ()

re.compiler()

1.4 match object of Re library

1.4.1 Properties of the Match object

1.4.2 Methods of Match Object

1.5 Greedy matching and minimum matching of the Re library

1.6 Summary

Example 2 "Taobao Commodity Price Comparison Targeted Crawler" (requests+re)

1 "Taobao commodity price comparison directional crawler" example introduction

2 "Taobao commodity price comparison directional crawler" example compilation

Overall structure

getHTMLText()

parsePage ()

printGoodsList()

3 summary

Example 3 "Stock Data Oriented Crawler" (requests+bs4+re)

1 Introduction to the "Stock Data Targeted Crawler" example

2 Example preparation of "stock data directional crawler"

Overall framework

getHTMLText()

getStockList()

getStockInfo()

3 Example optimization of "stock data directional crawler"

3.1 Speed ​​improvement: optimization of code recognition

3.2 Experience improvement: add dynamic progress prompt

Guess you like

3.1 Speed improvement: optimization of code recognition