[Python study notes] reptiles

To use Python to crawl the page, we need to learn the following four modules:

package effect
web browser Open the browser gets the specified page;
requests Downloading files and pages from the Internet;
Beautiful Soup Parsing HTML, i.e. encoded in the page format;
selenium Initiate and control a Web browser, can fill out the form, and simulate a mouse click.

Small projects: the use of webbrowser module bilibiliSearch.py

webbrowser module open()function can start a new browser, open the specified URL.

import webbrowser
webbrowser.open("https://bilibili.com")

Run the above code example, the system will start a new tab, opens the B station. But it is also webbrowser module can do a thing.

If we are to achieve a function: Using Bilibili or search for specific text in the clipboard. That our program needs to do:

  • Command line parameters or the contents of the clipboard made to be searched;
  • Open a Web browser, point to the search results.

Clarify Bilibili Search designated locations Url address

We first B station manually search for "Sun Xiaochuan Kichiku" can be found Url URL address bar as: " https://search.bilibili.com/all?keyword= Sun Xiaochuan 20% Kichiku ," so we can find, as long as you want to search to, multiple keywords separated by spaces of content on keyword = after, in the URL is 20%expressed.

Processing command line arguments

To import parameters from the command line contents to be searched, we need to import sys bag and use sys.argvto get their content.

Deal with the clipboard contents

Use pyperclip package pyperclip.paste()function to get the contents of the clipboard.

Thus, the entire program code is as follows:

import webbrowser, sys, pyperclip

if len(sys.argv) > 1:
    # Get address from command line.
    address = ' '.join(sys.argv[1:])
else:
    # Get address from clipboard.
    address = pyperclip.paste()

webbrowser.open("https://search.bilibili.com/all?keyword="+address)

Download files from the Web module with requests

requests.get()Receiving a function to download Url address string and returns a Response object containing the response from the Web server to your request made.

>>> import requests
>>> res = requests.get('http://www.gutenberg.org/cache/epub/1112/pg1112.txt')
>>> type(res)
<class 'requests.models.Response'>
>>> res.status_code == requests.codes.ok
True
>>> len(res.text)
179378
>>> requests.codes.ok
200
>>> print(res.text[:300])    
The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare


*******************************************************************
THIS EBOOK WAS ONE OF PROJECT GUTENBERG'S EARLY FILES PRODUCED AT A
TIME WHEN PROOFING METHODS AND TOOLS WERE NOT WELL DEVELOPED. THERE
IS AN IMPROVED

The object contains a number of attributes, status_codecalled the status code, the property value can check whether the request is successful, textthe properties included in the text information in the web page. It is worth mentioning that the successful visit of the status code 200, in general, others are failures.

We can use when accessing the failed raise_for_status()method throws an exception, the program stops. If we just want to throw an exception, you do not want the program to stop, you can use tryand exceptwrap the statement.

>>> res = requests.get("http://www.donotexists777.com")
>>> res.status_code
404
>>> res.raise_for_status()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\Anaconda3\envs\mlbook\lib\site-packages\requests\models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://miwifi.com/diagnosis/index.html

Save the downloaded file to your hard drive

>>> import requests
>>> res = requests.get('http://www.gutenberg.org/cache/epub/1112/pg1112.txt')
>>> res.raise_for_status()
>>> playFile = open('RomeoAndJuliet.txt', 'wb')
>>> for chunk in res.iter_content(100000):
...     playFile.write(chunk)
...
100000
79380
>>> playFile.close()

Using standard open()and write()method to save the Web page to a local file, used here wb argument as a way to open the file in binary, and allows us to write the text in the "Unicode encoding."

To reduce memory usage, we use the iter_content()method in batches written to the local text documents.

Use BeautifulSoup module parses HTML

Reason not to use the expression parsing policy Html formats is that, Html formats can have many different ways and still be considered valid Html, but try to use regular expressions to capture all these possible changes, will be very tedious, and Error-prone. BeautifulSoup module is not easy to cause defects.

Create an object from HTML BeautifSoup

bs4.BeautifulSoup () function call expects a string that contains to be parsed HTML. bs4.BeautifulSoup () function returns a BeautifulSoup object.

>>> import requests, bs4
>>> res = requests.get('http://www.baidu.com')
>>> res.raise_for_status()
>>> baiduSoup = bs4.BeautifulSoup(res.text)
>>> type(baiduSoup)
<class 'bs4.BeautifulSoup'>

Use select () method to find elements

Looking for the elements, calls select()the method, passing a string as CSS "Selector", so that you can obtain Web page elements.

<!-- This is the example.html example file. -->

<html><head><title>The Website Title</title></head>
<body>
<p>Download my <strong>Python</strong> book from <a href="http://
inventwithpython.com">my website</a>.</p>
<p class="slogan">Learn Python the easy way!</p>
<p>By <span id="author">Al Sweigart</span></p>
</body></html>

A look at the above example, save the above code as "example.html", and saved in the current folder.

>>> from bs4 import BeautifulSoup
>>> exampleFile = open('./example.html')
>>> exampleSoup = BeautifulSoup(exampleFile.read(), features='html.parser')
>>> elems = exampleSoup.select('#author')
>>> type(elems)
<class 'list'>
>>> len(elems)
1
>>> type(elems[0])
<class 'bs4.element.Tag'>
>>> elems[0].getText()
'Al Sweigart'
>>> str(elems[0])
'<span id="author">Al Sweigart</span>'
>>> elems[0].attrs
{'id': 'author'}

small program

Climb a cartoon Ranking - -

To see some of the need to log in to the fan we need to close all browsers, then enter cmd:

"C:\Program Files (x86)\Microsoft\Edge Beta\Application\msedge.exe" --remote-debugging-port=8888

Next:

from selenium import webdriver
import time
"""
First, run:
"C:\Program Files (x86)\Microsoft\Edge Beta\Application\msedge.exe" --remote-debugging-port=8888
in the cmd.
"""
options = webdriver.ChromeOptions()
options.debugger_address = "127.0.0.1:" + '8888'
options.binary_location = r"C:\Program Files (x86)\Microsoft\Edge Beta\Application\msedge.exe"
chrome_driver_binary = r"D:\APP\MicrosoftWebDriver.exe"
driver = webdriver.Chrome(chrome_driver_binary, chrome_options=options)
for i in range(1, 100):
    url = "https://bangumi.tv/anime/tag/%E6%90%9E%E7%AC%91?page=" + str(i)
    driver.get(url)
    time.sleep(5)  # Let the user actually see something!
    filepath = str(i) + '.html'
    with open(filepath, 'wb') as f:
        f.write(driver.page_source.encode("utf-8", "ignore"))
        print(filepath + "写入成功!")
        f.close()

The above is for the page file downloaded to the local 00

Below to parse - -

from bs4 import BeautifulSoup
import re

dateRegex = re.compile(r"(\d{4})\D+(\d{1,2})\D+(\d{1,2})")  # 规范化日期
with open('topAnime.txt', 'a', encoding="utf-8") as f:
    for i in range(1, 77):
        filepath = str(i) + '.html'
        soup = BeautifulSoup(open(filepath, encoding="utf-8"), 'html.parser')
        nameList = soup.find_all(name="h3")
        for name in nameList:
            link = "https://bangumi.tv" + name.contents[1]["href"]
            f.writelines('[' + name.contents[1].string.strip('\n') + ']' + '(' + link + ')')
            if len(name) >= 4:
                f.writelines("\t" + name.contents[3].string.strip('\n'))
            else:
                f.writelines("\tNone")
            for sibling in name.next_siblings:
                try:
                    if sibling.attrs["class"] == ['info', 'tip']:
                        # f.writelines("\tinfo: " + sibling.string.strip())
                        date = dateRegex.search(sibling.string.strip())
                        try:
                            f.writelines("\t" + date[1] + '-' + date[2] + '-' + date[3])
                        except TypeError:
                            continue
                    if sibling.attrs["class"] == ['rateInfo']:
                        try:
                            f.writelines("\t" + sibling.contents[3].string.strip('\n'))
                        except IndexError:
                            f.writelines("\t0")
                            continue
                except AttributeError:
                    continue
            f.writelines("\n")
f.close()
# timeList = soup.find_all(attrs={"class": "info tip"})
# for time in timeList:
#     f.writelines(time.string)

The end result here is not to show - -

Guess you like

Origin www.cnblogs.com/dereen/p/python_webScraping.html