Python crawling CSND blog

"Speak Python briefly", select "Top/Star Official Account"
and dry welfare goods will be delivered as soon as possible!
Python crawling CSND blog

1. Knowledge requirements

  • 2.1 Basic knowledge of Python (List and Tuple)
  • 2.2 urllib module, timeout setting, get method and post method to automatically simulate http request
  • 2.3 Exception handling and browser camouflage technology combat
    If you have forgotten some relevant knowledge, you can click the link above to get familiar with the relevant knowledge points.

2. Crawling information on the homepage of the CSND blog

Purpose: Crawl all news links on the homepage of the csdn blog and download them to a local folder.

(1) Enter the csdn blog homepage ( https://blog.csdn.net/), click the right mouse button, click to view the source code of the webpage, and then press ctrl + f on the source code webpage, and a search box will appear.
(2) Copy some news titles on the blog homepage, then search for the copied titles in the search box, find the searched content, and observe the connection characteristics of the news. Try a few more titles.

Python crawling CSND blog
Insert picture description here

(3) After changing a few titles, you will find that most of the local parts of the connection are only different in the red part, and the rest are the same. Then we can set the regular expression <a. ?href="(. ?)" target="_blank"
Python crawling CSND blog
Python crawling CSND blog
(4) Crawl the information on the csdn homepage

#首先导入模块
import re
import urllib.request
import urllib.error

#要爬取的网页的网站
url = "https://blog.csdn.net/"
#获得网页当前信息
page = urllib.request.urlopen(url).read().decode('utf-8', 'ignore') #参数'ignore'表示解码遇到异常时忽略异常,继续解码
#设置正则表达式
pat = '<a.*?href="(.*?)" target="_blank"'
#从网页信息中匹配出我们要的信息
links = re.compile(pat).findall(page)
print(links[:12])
#存放爬取的新闻网址信息
for i in range(0, len(links)):
    #防止出现异常,而停止信息爬取,采用异常处理措施
    try:
        urllib.request.urlretrieve(links[i], "D:\\python\\news\\"+str(i)+".html")
    except urllib.error.HTTPError as e:
        if hasattr(e, 'code'):
            print(e.code)
        if hasattr(e, 'reason'):
            print(e.reason)
print('爬取成功!')

(5) Run the above code and we will find an error. The crawling information will be interrupted when the crawling information is halfway. After analyzing the cause of the error, it is found that a non-URL string appears.

Python crawling CSND blog
(6) So, I have to use the statement links = [link for link in links if link[:4]=='http'] to filter out strings that are not URLs, so the final code is as follows:

#首先导入模块
import re
import urllib.request
import urllib.error

#要爬取的网页的网站
url = "https://blog.csdn.net/"
#获得网页当前信息
page = urllib.request.urlopen(url).read().decode('utf-8', 'ignore') #参数'ignore'表示解码遇到异常时忽略异常,继续解码
#设置正则表达式
pat = <a.*?href="(.*?)" target="_blank"'
#从网页信息中匹配出我们要的信息
links = re.compile(pat).findall(page)
print(len(links))
#爬取的过程中发现了异常,存在:<a href="/nav/ai" target="_blank">这样的代码,获取的不是网址,所有要进行过滤
links = [link for link in links if link[:4]=='http']
print(len(links))
#存放爬取的新闻网址信息
for i in range(0, len(links)):
    #防止出现异常,而停止信息爬取,采用异常处理措施
    try:
        urllib.request.urlretrieve(links[i], "D:\\python\\news\\"+str(i)+".html")
    except urllib.error.HTTPError as e:
        if hasattr(e, 'code'):
            print(e.code)
        if hasattr(e, 'reason'):
            print(e.reason)
print('爬取成功!')

(7) Run the program and you can see that we have filtered out 21 non-URL strings. Here I started from 0 and climbed to 122, indicating that all the web information I obtained was successfully crawled!

Python crawling CSND blog

I am a veteran, a down-to-earth person is easier to live a good life, this article is over.

Recommended reading:

Python Tutorial: Getting Started (Part 1)

Python Xiaobai Tutorial: Getting Started (Part 2)

10 time-saving PyCharm skills (with PyCharm download, install, crack and use tutorial)

Data analysis from scratch, actual combat,
data analysis from scratch, actual combat | Basics (1)
Data analysis, actual combat | Basics (2)
Data analysis, actual combat | Basics (3)
Data analysis, actual combat | Basics Article (4)
Read the following four articles carefully to quickly grasp the essentials of Python basic knowledge.

Complete Python basic knowledge
Python knowledge | these skills you do not? (1)
Python knowledge | these skills you do not? (2)
Python knowledge | these skills you do not? (3)
Python knowledge | these skills you No? (4)
I am an old watch, support me, please forward and share this article.

/Your attitude to learning Python/

Leave a message Python666 or I love Python to show your learning attitude.

Guess you like

Origin blog.51cto.com/15069482/2578576