Python builds a web crawler

Recently, I want to build my own personal website and move the content of my blog on CSDN, so I want to use Python to make a web crawler, automatically download my blog articles on CSDN and do some format conversion, and post them to on my personal website.

This web crawler needs to use python's requests, beautifulsoup, and selenium libraries.

Because selenium needs to operate based on the browser, we need to install the corresponding driver. Both Chrome and Firefox have corresponding drivers. Because I use firefox, download the driver of Firefox from this URL, Releases mozilla/geckodriver GitHub

Then we can call selenium to load the website, such as the following code:

from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By
from selenium import webdriver
import time

options = Options()
options.binary_location = r'C:\Program Files\Mozilla Firefox\firefox.exe'
options.add_argument("--headless")
options.add_argument("--no-sandbox")
service = Service(executable_path="c:/software/geckodriver.exe")
browser = webdriver.Firefox(service=service, options=options)
browser.get("https://blog.csdn.net/gzroy?type=blog")

The options in the code set the --headless parameter, which means that a browser window will not be opened. If you want to watch the status of selenium operating the browser, you can cancel this parameter setting. After waiting for the browser to load the webpage, we can read the list of articles and URLs:

articles = []
articles_num = 0
while True:
    elements = browser.find_elements(By.TAG_NAME, 'a')
    for item in elements:
        href = item.get_attribute("href")
        if type(href) is str:
            if href.startswith("https://blog.csdn.net/gzroy/article/details") and href not in articles:
                articles.append(href)
    if len(articles)<=articles_num:
        break
    if articles_num==0:
        browser.execute_script('window.scrollTo(0,document.body.scrollHeight/2)')
        time.sleep(0.5)
    articles_num = len(articles)
    browser.execute_script('window.scrollTo(0,document.body.scrollHeight)')
    time.sleep(2)

Because the CSDN article list will not be loaded all at once, it will be loaded one after another when the page is scrolled, so we need to execute Javascript statements to simulate scrolling to the bottom of the page. Another strange thing is that if you scroll to the bottom of the page for the first time, it seems that the page does not automatically record hidden articles, so I changed it a bit, first scroll to the middle of the page, and then scroll to the bottom of the page, that's it. It may be related to some specific settings of the website.

After obtaining the article list, we can visit each article in turn, read the content of the article and save it in the database.

Here I chose sqlite3, a lightweight database, using requests to get the content of the webpage, using beautifulsoup to parse the webpage, extracting the title, creation time and content of the article.

The first is to access the database and create a data table for saving articles. The code is as follows:

conn = sqlite3.connect('blog.db')
c = conn.cursor()
c.execute('''CREATE TABLE ARTICLE
    (ID INT PRIMARY KEY     NOT NULL,
    TITLE          TEXT    NOT NULL,
    CREATETIME     TEXT     NOT NULL,
    CONTENT        TEXT);''')
conn.commit()

Then build a loop to access an article in the article list each time, extract relevant content and save it to the database:

id = 0
for item in articles[::-1]:
    article = requests.get(item, headers = {'User-Agent':'Mozilla/5.0'})
    soup = BeautifulSoup(article.text, 'html.parser')
    title = soup.find('h1', {'id': 'articleContentId'}).text
    title_b64 = base64.b64encode(title.encode()).decode()
    content = soup.find('div', {'id':'content_views'}).encode_contents()
    content_b64 = base64.b64encode(content).decode()
    createTime = soup.find('span', {'class':'time'}).text[2:-3]
    sqlstr = "INSERT INTO ARTICLE (ID,TITLE,CREATETIME,CONTENT) VALUES (" + \
        str(id) + ",'" + title_b64 + "','" + createTime + "','" + content_b64 + "')"
    c.execute(sqlstr)
    id += 1
conn.commit()
conn.close()

Here, after I extract the title and content of the article, I first do a base64 conversion and then save it to the database. The main reason here is to avoid the impact of some special characters on the SQL statement.

So far, the entire crawler has processed all the articles. In the next step I will build a Blog and display our saved articles in Blobs. to be continued. . .

Guess you like

Origin blog.csdn.net/gzroy/article/details/128075985