Python web crawler 3 (the examples are not very applicable)

Guidance

Insert picture description here

1 Getting started with the Re (regular expression) library

1.1 The concept of regular expressions

Insert picture description here
It is too cumbersome to list all of them, so using regular expressions
Insert picture description here
can express a group of strings.
Example 1:
Insert picture description here
Example 2:
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here
The features after compilation correspond to a group of strings
. The regular expression before compilation is just one that conforms to the regular expression syntax. Single string

1.2 Regular expression syntax

Insert picture description here
Insert picture description here
".": Any character that appears on the character table
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here

The first one: does not consider the value range and space of each paragraph, only considers the "." between them to separate the
second: every string that appears is 0 or 1 or 2 or 3
above 2 are not precise enough
Insert picture description here

1.3 Basic use of Re library

Insert picture description here

1.3.1 Representation type of regular expression-native string type

Insert picture description here
The "in the native string is not expressed as an escape character,
Insert picture description here
so:
Insert picture description here

1.3.2 Main functions of Re library

Insert picture description here
re.search(): search for the same place as the regular expression in the string
re.match(): match only at the given position
re.findall(): find all the same as the regular expression in the string String of strings

re.search()

Insert picture description here
Insert picture description here
In the regular expression,'.' matches any character except'\n'. For
example: Chinese postal code, matching "BIT 10081"
Insert picture description here
, what's the use of writing'BIT', return from re.match(), this' BIT' has no specific meaning, just to show that search does not necessarily match from the beginning
Insert picture description here

re.match()

Insert picture description here
Insert picture description here
It is found that no match is found.
Insert picture description here
Because match matches from the beginning, it will not match'BIT 100081'.
If you don't judge whether it matches, an error will be reported.
Insert picture description here

re.findall()

Insert picture description here
Insert picture description here

re.split()

Insert picture description here
Insert picture description here
The second one above adds maxsplit=1 to indicate that it only matches once

re.finditer ()

Insert picture description here
Insert picture description here
Ability to iteratively return each result and process each result separately
Insert picture description here
Insert picture description here

re.compiler()

Insert picture description here

Insert picture description here
Insert picture description here

1.4 match object of Re library

Insert picture description here

1.4.1 Properties of the Match object

Insert picture description here

1.4.2 Methods of Match Object

Insert picture description here
Insert picture description here
The output is marked with compile, which means that only after compile will be
Insert picture description here
the start and end position of the regular expression search

Insert picture description here
match returns the result of the first match. If you want to return every time, you must use finditer() to
Insert picture description here
match
Insert picture description here
the binary relationship between the start and end positions of the string.

1.5 Greedy matching and minimum matching of the Re library

Insert picture description here
PY.*N: start with PY, end with N, and a string of any characters in the middle
Insert picture description here
Insert picture description here
Insert picture description here

1.6 Summary

Insert picture description here

Example 2 "Taobao Commodity Price Comparison Targeted Crawler" (requests+re)

1 "Taobao commodity price comparison directional crawler" example introduction

Insert picture description here
Insert picture description here
Insert picture description here
This example does not harass Taobao’s servers
Insert picture description here

2 "Taobao commodity price comparison directional crawler" example compilation

Overall structure

Insert picture description here
Insert picture description here
depth=2: crawl 2 pages
try:

except:
continue
crawling a page after an error, continue to crawl the next page

getHTMLText()

Insert picture description here

parsePage ()

Because the price of Taobao uses a scripting language, it can be done only by search, so there is no need to use bs, here is just a regular rule,
but now the search on Taobao will pop up the login interface, this crawler may not be suitable for Now the
Taobao source code in the video.
Its price is:

  • "View_price": 139.9
    Its name is in:
  • “raw_Title”:xxxxx

Insert picture description here
Insert picture description here
The first slash is an escaped
regular expression: "view_price":"[\d.]* "
Insert picture description here
eval(): remove the double quotes and
take the following ":"

printGoodsList()

Insert picture description here
Insert picture description here
Indicates that the length of the first element of the output is 4, the length of the second element is 8, and the length of the third element is 16
Insert picture description here

3 summary

Insert picture description here
Source code

import requests
import re
 
def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""
     
def parsePage(ilt, html):
    try:
        plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)
        tlt = re.findall(r'\"raw_title\"\:\".*?\"',html)
        for i in range(len(plt)):
            price = eval(plt[i].split(':')[1])
            title = eval(tlt[i].split(':')[1])
            ilt.append([price , title])
    except:
        print("")
 
def printGoodsList(ilt):
    tplt = "{:4}\t{:8}\t{:16}"
    print(tplt.format("序号", "价格", "商品名称"))
    count = 0
    for g in ilt:
        count = count + 1
        print(tplt.format(count, g[0], g[1]))
         
def main():
    goods = '书包'
    depth = 3
    start_url = 'https://s.taobao.com/search?q=' + goods
    infoList = []
    for i in range(depth):
        try:
            url = start_url + '&s=' + str(44*i)
            html = getHTMLText(url)
            parsePage(infoList, html)
        except:
            continue
    printGoodsList(infoList)
     
main()

It doesn't seem to be good anymore, I haven't climbed anything
Insert picture description here

Example 3 "Stock Data Oriented Crawler" (requests+bs4+re)

1 Introduction to the "Stock Data Targeted Crawler" example

Insert picture description here
Insert picture description here

Insert picture description here
Looking at the source code, you can see that Baidu stock is more suitable.
However, Baidu stock cannot find many stock information on one page.
Insert picture description here
Therefore, open the Oriental Fortune website.

Insert picture description here

Insert picture description here
Refer to the storage method of the page to perform related calibration for each information source and information value. Key-value pairs can be used. The dictionary type is the data type that maintains the key-value pairs to save the information of each stock, and then use the dictionary to compare all stocks. Integration of information

2 Example preparation of "stock data directional crawler"

For debugging convenience, use the traceback library

Overall framework

Insert picture description here
getStockList(): get the list of stocks
getStockInfo(): get the information of a single stock
Insert picture description here

getHTMLText()

Insert picture description here

getStockList()

Eastern Fortune.com source code:
Insert picture description here
all stored in, as long as it is parsed, there will be stock codes (the last few digits of href)
using regular expressions, but not all hrefs in them will meet the conditions, so try can be used …Except
Insert picture description here
regular expression is the stock code of Shenzhen or Shanghai, starting with s, followed by h or z, and then 6 digits

getStockInfo()

The source code of Baidu stock

Insert picture description here

First, send a request to each stock.
Use try...except to ensure that the returned page is normal.
All stock information is encapsulated in

Under class='stock-bets', use the split function to obtain the complete part of the corresponding
stock name with the stock name in class='bets-name'
Insert picture description here
, because some names are also associated with other identifiers, using spaces which was taken after the separation points out part 0
then, the information in all of the stock
key
value

Insert picture description here
Insert picture description here
Complete code

import requests
from bs4 import BeautifulSoup
import traceback
import re
 
def getHTMLText(url):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""
 
def getStockList(lst, stockURL):
    html = getHTMLText(stockURL)
    soup = BeautifulSoup(html, 'html.parser') 
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
        except:
            continue
 
def getStockInfo(lst, stockURL, fpath):
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html=="":
                continue
            infoDict = {
    
    }
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div',attrs={
    
    'class':'stock-bets'})
 
            name = stockInfo.find_all(attrs={
    
    'class':'bets-name'})[0]
            infoDict.update({
    
    '股票名称': name.text.split()[0]})
             
            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val
             
            with open(fpath, 'a', encoding='utf-8') as f:
                f.write( str(infoDict) + '\n' )
        except:
            traceback.print_exc()
            continue
 
def main():
    stock_list_url = 'https://quote.eastmoney.com/stocklist.html'
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    output_file = 'D:/BaiduStockInfo.txt'
    slist=[]
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)
 
main()

This is not good either. The website has been revised and cannot be crawled out. The importance of keeping up with the times

3 Example optimization of "stock data directional crawler"

Improve user experience, but as long as requests and bs4 are used, the speed will not increase

3.1 Speed ​​improvement: optimization of code recognition

Insert picture description here
Manually obtain the coding method.
Modified to:
Insert picture description here
The code of Oriental Fortune.com is'GB2312'.
Insert picture description here
Baidu stock adopts'utf-8', and it is not modified.

3.2 Experience improvement: add dynamic progress prompt

Insert picture description here
There are a lot of crawling pages, the progress is displayed dynamically, and the progress bar that does not wrap dynamically is added.

  • Add 1 count variable
    Insert picture description here
  • No line breaks, using the escape character'\r', can bring the last cursor of the string we printed to the head of the current line, and the previous
    Insert picture description here
    '\r' will be overwritten in the next printing. Idle is forbidden Dropped, so use the command line

Guess you like

Origin blog.csdn.net/qq_42713936/article/details/105897504